Wow, this is pretty good. My sort of benchmark is a photo of someone holding my sweet little Ollie. Well, he's not so little and he's not mine anymore, but he'll always be my widdle Ollie!
Anyways, GPT4-vision wasn't always able to tell me that he doesn't really look the most comfortable being held, cause that's a lot of gravity pulling him down. Neither was Llava in the past. But with Llava 1.6 34B, it can, with no further questions asked besides "Please describe this image" as the first user message along with the image. So yeah this is really amazing! Its OCR has also definitely improved. Before, it'd just say text is in another language, but now it just shows the text. Can't wait to have a good enough computer to quickly run this locally.
I can't imagine how much of a difference it would make when vision models like this can be run at the edge in real time for blind users, including proper triage of relevant scene information narration.
Wild to think about the long tail of accessibility and how that's going to be wildly different with the improvements to generative AI.
>LLaVA-1.6 is trained with 32 GPUs for ~1 day, with 1.3M data samples in total. The compute / training data cost is 100-1000 times smaller than others.
Agreed. If the resource usage can be optimized further, it'll be feasible to train specialist models both on-prem and from scratch. That would sidestep the liability and privacy issues of current cloud-based offerings.
Google's search appliances did the same thing before they were retired. Hospitals were especially keen on them. They eliminated HIPAA risks because data never left the hospital intranet, but then Google eliminated the product line in favor of cloud offerings.
My belief, which may be false and not informed by direct experience, is that the search appliances didn't really work that well since the hyperlink based search ranking method of the day doesn't really work on relatively small data holdings of an individual organization. Can anyone confirm if that was the case?
I used it in, what, 2004? 2005? It was pretty good, especially compared to today's typical corporate on-prem search options, e.g. Sharepoint or Confluence, both of which are almost certainly inferior to a physical filing cabinet.
TL;DR: It seemed to work for organizations which fit the appliance's assumptions
> hyperlink based search ranking method of the day doesn't really work on relatively small data holdings of an individual organization
In my experience, it depends on what you mean by relatively small. Cloud editing and permissions aside, there seemed to be the 3 key factors:
1. Structure: hub-and-spoke sites interlinking crucial documents and pages
2. Culture: interdepartmental and personal rivalries which demand links as credit
3. Size: lots of employees creating and maintaining sites and content
I was too young to understand it then, but all of these help an intranet's content resemble the web of the 90s and 2000s. That's the era Google's algorithm was designed for.
The results weren't always perfect, but they were nearly always better than the alternatives. In-house or pre-existing alternatives tended to be awful, especially not having any search at all.
Keep in mind, all of this was ~10 years ago. My memory isn't perfect and a lot of things have changed since then.
I even made sure some of them did. It was my job at a few, but the relationships and boundaries between these were complicated. For convenience, let's say n_observed roughly equals 4.
Good point on structuring internal knowledge holdings to enable PageRank. The organizations I've worked in seem to put zero thought into this and instead just leave it up to individuals.
TL;DR: There was no intentional structuring, just Conway's law[1] in action
I think you misunderstood: the organizations where PageRank worked best also put zero thought into it. I think it's more like holdovers from 90s and 2000s management trends emergently created an intranet structure which fit the assumptions underlying PageRank. Even interdepartmental tensions may have helped due to imitating the company behaviors of those eras.
Damn, literally a day after I wrote up my experiments[0] with LLaVA 1.5 and computing image embeddings. Interesting to see the performance with the fine-tuned Mistral-7B variant being pretty close to the one with Vicuna-13B - using Mistral 7B is what BakLLaVA did back with LLaVA 1.5.
Anyone got any fun stories to share of trying to use LLaVA e.g. to make toy robots navigate? How good is it at outputting directions in structured data, guess distances, angles, etc.?
My weekend hacking goal would be something like an RC car that can "drive to the largest plant in the room" or "go hide under the dining table" when prompted by voice. Slowly, by combining some sort of basic SLAM with still image promoting.
Of course an alternative to doing it one-shot would be to collect lots of pictures + orientation for each, have LLaVA only caption them, then prompt a more generic LLM with that collected world info to pick where to go, etc.
What I like most about this AI stuff is how many neat things it makes achievable in a weekend by a motivated hobbyist that in the past required entire companies to tackle :). DIY/maker life in the AI age has been amazing fun do far.
Apologies, this was the wrong image upload, and it's now too late to edit the post. The intended one was a screenshot of a LLaVA 1.6 demo conversation about it:
My best guess is that you want a supervisor GPT-4 like LLM planning the task and a lower level on-prem LLM doing the tasks like driving from one location to another or grasping an item.
Sending every frame to GTP-4 right now is way too slow. But at Tesla FSD like model can drive from one location to another in a closed environment with perfection. All that is missing is training that style in a roomba/robot form and then having GTP-4 monitor and manage the tasks at a 10 or 20 second interval
Oh, I'm OK with going slow! It doesn't have to be all that practical, I'm more curious about playing with toy approaches. Trying to populate a world model with few captioned still frames plus basic IMU/dead reckoning seems like a fun challenge ...
Reminds me of the recent HN post on jumping spider intelligence: They can do complex route planning, but need to stare for hours before they get going. This is probably more down to their tiny field of view on the front-facing good eyes, but still :-)
My current test of image models is generating React code from a screenshot of the top of a HackerNews comment page. Llava-1.6 gave me this (over two responses), which is honestly not bad:
```ts
const CommentForm = () => {
// State to hold the user's input
const [comment, setComment] = useState('');
// State to hold the list of comments
const [comments, setComments] = useState([]);
// Function for posting a new comment
const postComment = (e) => {
e.preventDefault();
// Add logic here to handle the POST request and update state
setComments([...comments, { content: comment }]); // Assuming you want the entire object in your state
setComment(''); // Reset the input field after posting
};
...
```
```ts
import React from 'react';
import { CommentForm } from './CommentForm';
Wow! You folks are making huge strides for open-source multimodal models. Thank you for all the time and effort on these as they will open up many opportunities for researchers and developers. Also, the emergent zero-shot capabilities when LLaVA-1.6 is tested against Chinese benchmarks with only English multi-modal training data are interesting and that may be a good direction for future research.
My main interest with VLM is their ability to caption images, and this one seems very good honestly, this is going to be super useful to caption datasets.
This thought just occurred to me: would it make sense to train a model to recognise vector encoded video frames?
I completely forgot how video encoders work, but I do remember that some encode the differences between one frame and the next with vectores of the "motion" of pixels. The training data would be comprised of frames from a video with labels assigned to time ranges across the length of the video.
The we could feed a video stream to a model and it would learn to not only recognise still images, but also motion across time.
Make it fast enough and we would have near real time inference of video.
If this works, maybe an extension to this model would be to accept its previous inference result as an input to the next frame inference request. Then we'd have results like "a person entered the bright light seen of a sunny day in the country side".
I tested GPT-4V and Llava 1.6 on a Chinese text and they both hallucinate like crazy. LVMs can still barely recognize caracters while traditional OCR nails it. Do someone know why ?
Almost definitely just that it hasn't been trained on such data, rather than the task being inherently difficult. I tried Llava 1.6 on an image with Swedish text and it parsed a large and clear Ä as A while the other letters were mostly correct.
one potential use case I've had in my and never gotten around to making is using them detailed tag and sort a huge library of images, OCR is useless here.
the idea would be to have a more semantic type of search based on the content of the image or its art style.
so far with my test gpt4-v seems to perform better, though very heavy on the censorship guardrails.
this 1.6 perfoms better than the previous version and seems to hallucinate a bit less.
One use is that these models can do OCR in the wild, e.g. reading text from a sign on a window in a photo. I think traditional OCR libraries are more focused on reading printed pages.
What makes you certain there is no reasoning involved here? Is it lack of "intent"? Does the user's prompt not provide sufficient intent to the LLM?
Based on the demo linked in the article, you can specifically prompt "What is unusual about this image? Walk me through your reasoning step by step" and get a thorough understanding of the reasoning behind the LLM's response.
So, yes, words do have meaning, and the word "reasoning" appears apt.
It's not perfect but it can reason, or solve many more situations than trained on.
Here is a paper demonstrating that GPT-4 can combine up to 5 skills from a set of 100, effectively covering 100^5 tuples of skills, while only seeing much fewer combinations in training on a specific topic.
> simple probability calculations indicate that GPT-4's reasonable performance on k=5 is suggestive of going beyond "stochastic parrot" behavior (Bender et al., 2021), i.e., it combines skills in ways that it had not seen during training
Anyways, GPT4-vision wasn't always able to tell me that he doesn't really look the most comfortable being held, cause that's a lot of gravity pulling him down. Neither was Llava in the past. But with Llava 1.6 34B, it can, with no further questions asked besides "Please describe this image" as the first user message along with the image. So yeah this is really amazing! Its OCR has also definitely improved. Before, it'd just say text is in another language, but now it just shows the text. Can't wait to have a good enough computer to quickly run this locally.