LLaVA-1.6: Improved reasoning, OCR, and world knowledge

devinprater · on Feb 1, 2024

Wow, this is pretty good. My sort of benchmark is a photo of someone holding my sweet little Ollie. Well, he's not so little and he's not mine anymore, but he'll always be my widdle Ollie!

Anyways, GPT4-vision wasn't always able to tell me that he doesn't really look the most comfortable being held, cause that's a lot of gravity pulling him down. Neither was Llava in the past. But with Llava 1.6 34B, it can, with no further questions asked besides "Please describe this image" as the first user message along with the image. So yeah this is really amazing! Its OCR has also definitely improved. Before, it'd just say text is in another language, but now it just shows the text. Can't wait to have a good enough computer to quickly run this locally.

kromem · on Feb 1, 2024

It must be a pretty exciting time for you.

I can't imagine how much of a difference it would make when vision models like this can be run at the edge in real time for blind users, including proper triage of relevant scene information narration.

Wild to think about the long tail of accessibility and how that's going to be wildly different with the improvements to generative AI.

dog436zkj3p7 · on Feb 1, 2024

What?

washadjeffmad · on Feb 1, 2024

Check their profile.

thelastparadise · on Feb 1, 2024

I did check it...

fngjdflmdflg · on Jan 31, 2024

To me this is the money shot:

>LLaVA-1.6 is trained with 32 GPUs for ~1 day, with 1.3M data samples in total. The compute / training data cost is 100-1000 times smaller than others.

pushfoo · on Jan 31, 2024

Agreed. If the resource usage can be optimized further, it'll be feasible to train specialist models both on-prem and from scratch. That would sidestep the liability and privacy issues of current cloud-based offerings.

Google's search appliances did the same thing before they were retired. Hospitals were especially keen on them. They eliminated HIPAA risks because data never left the hospital intranet, but then Google eliminated the product line in favor of cloud offerings.

jebarker · on Feb 1, 2024

My belief, which may be false and not informed by direct experience, is that the search appliances didn't really work that well since the hyperlink based search ranking method of the day doesn't really work on relatively small data holdings of an individual organization. Can anyone confirm if that was the case?

snarg · on Feb 1, 2024

I used it in, what, 2004? 2005? It was pretty good, especially compared to today's typical corporate on-prem search options, e.g. Sharepoint or Confluence, both of which are almost certainly inferior to a physical filing cabinet.

pushfoo · on Feb 1, 2024

TL;DR: It seemed to work for organizations which fit the appliance's assumptions

> hyperlink based search ranking method of the day doesn't really work on relatively small data holdings of an individual organization

In my experience, it depends on what you mean by relatively small. Cloud editing and permissions aside, there seemed to be the 3 key factors:

1. Structure: hub-and-spoke sites interlinking crucial documents and pages

2. Culture: interdepartmental and personal rivalries which demand links as credit

3. Size: lots of employees creating and maintaining sites and content

I was too young to understand it then, but all of these help an intranet's content resemble the web of the 90s and 2000s. That's the era Google's algorithm was designed for.

The results weren't always perfect, but they were nearly always better than the alternatives. In-house or pre-existing alternatives tended to be awful, especially not having any search at all.

Keep in mind, all of this was ~10 years ago. My memory isn't perfect and a lot of things have changed since then.

I even made sure some of them did. It was my job at a few, but the relationships and boundaries between these were complicated. For convenience, let's say n_observed roughly equals 4.

jebarker · on Feb 1, 2024

Good point on structuring internal knowledge holdings to enable PageRank. The organizations I've worked in seem to put zero thought into this and instead just leave it up to individuals.

pushfoo · on Feb 3, 2024

TL;DR: There was no intentional structuring, just Conway's law[1] in action

I think you misunderstood: the organizations where PageRank worked best also put zero thought into it. I think it's more like holdovers from 90s and 2000s management trends emergently created an intranet structure which fit the assumptions underlying PageRank. Even interdepartmental tensions may have helped due to imitating the company behaviors of those eras.

1. https://en.wikipedia.org/wiki/Conway%27s_law

mountainriver · on Feb 1, 2024

Yes but to be clear that’s just the vision layers

mildbyte · on Jan 31, 2024

Damn, literally a day after I wrote up my experiments[0] with LLaVA 1.5 and computing image embeddings. Interesting to see the performance with the fine-tuned Mistral-7B variant being pretty close to the one with Vicuna-13B - using Mistral 7B is what BakLLaVA did back with LLaVA 1.5.

[0] https://mildbyte.xyz/blog/llama-cpp-python-llava-gpu-embeddi...

sho_hn · on Jan 31, 2024

Anyone got any fun stories to share of trying to use LLaVA e.g. to make toy robots navigate? How good is it at outputting directions in structured data, guess distances, angles, etc.?

My weekend hacking goal would be something like an RC car that can "drive to the largest plant in the room" or "go hide under the dining table" when prompted by voice. Slowly, by combining some sort of basic SLAM with still image promoting.

This looks quite promising: https://i.imgur.com/DnYWYPl.jpeg

Of course an alternative to doing it one-shot would be to collect lots of pictures + orientation for each, have LLaVA only caption them, then prompt a more generic LLM with that collected world info to pick where to go, etc.

What I like most about this AI stuff is how many neat things it makes achievable in a weekend by a motivated hobbyist that in the past required entire companies to tackle :). DIY/maker life in the AI age has been amazing fun do far.

sho_hn · on Feb 1, 2024

Apologies, this was the wrong image upload, and it's now too late to edit the post. The intended one was a screenshot of a LLaVA 1.6 demo conversation about it:

https://i.imgur.com/J4yZ8xH.png

gitfan86 · on Feb 1, 2024

My best guess is that you want a supervisor GPT-4 like LLM planning the task and a lower level on-prem LLM doing the tasks like driving from one location to another or grasping an item.

Sending every frame to GTP-4 right now is way too slow. But at Tesla FSD like model can drive from one location to another in a closed environment with perfection. All that is missing is training that style in a roomba/robot form and then having GTP-4 monitor and manage the tasks at a 10 or 20 second interval

sho_hn · on Feb 1, 2024

Oh, I'm OK with going slow! It doesn't have to be all that practical, I'm more curious about playing with toy approaches. Trying to populate a world model with few captioned still frames plus basic IMU/dead reckoning seems like a fun challenge ...

Reminds me of the recent HN post on jumping spider intelligence: They can do complex route planning, but need to stare for hours before they get going. This is probably more down to their tiny field of view on the front-facing good eyes, but still :-)

sestinj · on Feb 1, 2024

My current test of image models is generating React code from a screenshot of the top of a HackerNews comment page. Llava-1.6 gave me this (over two responses), which is honestly not bad:

```ts

const CommentForm = () => { // State to hold the user's input const [comment, setComment] = useState(''); // State to hold the list of comments const [comments, setComments] = useState([]);

  // Function for posting a new comment
  const postComment = (e) => {
    e.preventDefault();
    // Add logic here to handle the POST request and update state
    setComments([...comments, { content: comment }]); // Assuming you want the entire object in your state
    setComment(''); // Reset the input field after posting
  };

... ```

```ts

import React from 'react'; import { CommentForm } from './CommentForm';

const App = () => { return ( <div> <h1>Comments</h1> <CommentForm /> {comments && <ul> {comments.map(comment => <li>{comment.content}</li>)} </ul>} </div> ); };

export default App; ```

thelastparadise · on Feb 1, 2024

Perhaps try two stages.

Stage 1: Generate detailed description from image w/llava.

Stage2: Code the page using miqu 1.

benopal64 · on Jan 31, 2024

Wow! You folks are making huge strides for open-source multimodal models. Thank you for all the time and effort on these as they will open up many opportunities for researchers and developers. Also, the emergent zero-shot capabilities when LLaVA-1.6 is tested against Chinese benchmarks with only English multi-modal training data are interesting and that may be a good direction for future research.

GaggiX · on Jan 31, 2024

Demo: https://llava.hliu.cc/

My main interest with VLM is their ability to caption images, and this one seems very good honestly, this is going to be super useful to caption datasets.

m00x · on Jan 31, 2024

CogVLM is also really good at image captioning. I've been using it in the past month and it's shown very good results.

I'm excited to see how this works in practice though.

bobosha · on Feb 2, 2024

We are considering CogVLM for our application and I have a few questions:

* what scale are you running it at? Can this run in a production setting without blowing up budget (GPU == $$$)

* what usecases are you captioning for?

any other insights/lessons learned you could share?

joee2711 · on Feb 5, 2024

have you recieved any info on this? If yes please lemme know

ranguna · on Feb 1, 2024

This thought just occurred to me: would it make sense to train a model to recognise vector encoded video frames?

I completely forgot how video encoders work, but I do remember that some encode the differences between one frame and the next with vectores of the "motion" of pixels. The training data would be comprised of frames from a video with labels assigned to time ranges across the length of the video.

The we could feed a video stream to a model and it would learn to not only recognise still images, but also motion across time.

Make it fast enough and we would have near real time inference of video.

If this works, maybe an extension to this model would be to accept its previous inference result as an input to the next frame inference request. Then we'd have results like "a person entered the bright light seen of a sunny day in the country side".

mvelbaum · on Feb 1, 2024

Is there an OSS model that can do OCR on the level of Google Cloud Vision or Amazon Textract?

bsenftner · on Feb 1, 2024

Does anyone have any information about LLaVA-1.6 being bundled as a LLamafile? https://github.com/Mozilla-Ocho/llamafile

QuiCasseRien · on Feb 1, 2024

I'm waiting for the same !!

andrewmutz · on Feb 1, 2024

Anyone know how much it costs to run this yourself vs paying money for GPT 4 vision?

The article says the training costs are far lower, but it doesn't say how the inference costs compare (unless I'm missing it)

Departed7405 · on Feb 1, 2024

I tested GPT-4V and Llava 1.6 on a Chinese text and they both hallucinate like crazy. LVMs can still barely recognize caracters while traditional OCR nails it. Do someone know why ?

7734128 · on Feb 1, 2024

Almost definitely just that it hasn't been trained on such data, rather than the task being inherently difficult. I tried Llava 1.6 on an image with Swedish text and it parsed a large and clear Ä as A while the other letters were mostly correct.

GaggiX · on Feb 1, 2024

Well for LLaVa-1.6 we know it was trained only on English multi-modal data.

bredren · on Feb 1, 2024

Can anyone comment on the practical use of a model like this versus traditional libraries for OCR?

I’m specifically interested in processing smartphone photos of pages from an out of print book.

gryn · on Feb 1, 2024

one potential use case I've had in my and never gotten around to making is using them detailed tag and sort a huge library of images, OCR is useless here. the idea would be to have a more semantic type of search based on the content of the image or its art style.

so far with my test gpt4-v seems to perform better, though very heavy on the censorship guardrails.

this 1.6 perfoms better than the previous version and seems to hallucinate a bit less.

jebarker · on Feb 1, 2024

One use is that these models can do OCR in the wild, e.g. reading text from a sign on a window in a photo. I think traditional OCR libraries are more focused on reading printed pages.

_ugfj · on Jan 31, 2024

There's no reasoning involved with LLMs. Please. Words have meaning.

ukuina · on Feb 1, 2024

What makes you certain there is no reasoning involved here? Is it lack of "intent"? Does the user's prompt not provide sufficient intent to the LLM?

Based on the demo linked in the article, you can specifically prompt "What is unusual about this image? Walk me through your reasoning step by step" and get a thorough understanding of the reasoning behind the LLM's response.

So, yes, words do have meaning, and the word "reasoning" appears apt.

_ugfj · on Feb 1, 2024

[flagged]

visarga · on Feb 1, 2024

It's not perfect but it can reason, or solve many more situations than trained on.

Here is a paper demonstrating that GPT-4 can combine up to 5 skills from a set of 100, effectively covering 100^5 tuples of skills, while only seeing much fewer combinations in training on a specific topic.

> simple probability calculations indicate that GPT-4's reasonable performance on k=5 is suggestive of going beyond "stochastic parrot" behavior (Bender et al., 2021), i.e., it combines skills in ways that it had not seen during training

https://arxiv.org/abs/2310.17567

m00x · on Jan 31, 2024

Untrue, but also incorrect to call it an LLM. This is a VLM.

ranguna · on Feb 1, 2024

Isn't it a LMMM large mutli modal model?

exe34 · on Feb 1, 2024

How can you tell? Or to avoid proving a negative, how would you measure if reasoning was involved?

visarga · on Jan 31, 2024

call it what you want