Chinese researchers unveil LLaVA-o1 to challenge OpenAI’s o1 model

Join our daily and weekly newsletters for the latest updates and exclusive content covering cutting-edge AI. Learn more

OpenAIThe o1 model showed that scaling inference time (using more calculations during inference) can significantly improve the reasoning capabilities of a language model. LLaVA-o1a new model developed by researchers from several Chinese universities, brings this paradigm to open source vision language models (VLM).

Early open source VLMs typically use a direct prediction approach, generating answers without reasoning about the prompt and the steps required to resolve the prompt. Without a structured reasoning process, they are less effective at tasks that require logical reasoning. Advanced prompting techniques such as chain of thought Incentives (CoT), where the model is encouraged to generate intermediate reasoning steps, produce marginal improvements. But VLMs often produce errors or hallucinate.

Researchers observed that a key problem is that the reasoning process in existing VLMs is not sufficiently systematic and structured. Models do not generate chains of reasoning and often get stuck in reasoning processes where they do not know what stage they are at and what specific problem they need to solve.

“We observe that VLMs often initiate responses without adequately organizing the problem and the available information,” the researchers write. “Moreover, they frequently deviate from logical reasoning toward conclusions, instead of presenting a conclusion prematurely and then attempting to justify it. Because language models generate responses token by token, once an erroneous conclusion is introduced, the model typically continues down an erroneous reasoning path.

Multi-step reasoning

OpenAI o1 uses inference time scaling to solve the systematic and structured reasoning problem and allows the model to pause and review its results as it progressively solves the problem. Although OpenAI has not published many details about the underlying mechanism of o1, its results show promising avenues for improving the reasoning capabilities of fundamental models.

Inspired by o1, researchers designed LLaVA-o1 to perform step-by-step reasoning. Instead of generating a direct reasoning chain, LLaVA-o1 breaks down the reasoning process into four distinct steps:

Summary: The model first provides a high-level summary of the question, describing the main problem it needs to answer.

Legend: If an image is present, the model describes the relevant parts, focusing on elements related to the question.

Reasoning: Based on the summary, the model performs structured and logical reasoning to obtain a preliminary answer.

Conclusion: Finally, the model presents a concise summary of the answer based on the previous reasoning.

Only the conclusion step is visible to the user; the other three steps represent the internal reasoning process of the model, similar to the hidden reasoning trace of o1. This structured approach allows LLaVA-o1 to manage its reasoning process independently, leading to better performance on complex tasks.

“This structured approach allows the model to independently manage its reasoning process, thereby improving its adaptability and performance on complex reasoning tasks,” the researchers write.

*Scene-level beam search (right) compared to other inference time scaling techniques Source: arXiv*

LLaVA-o1 also introduces a new inference time scaling technique called “scene-level beam search”. Stage-level beam search generates multiple candidate outputs at each reasoning stage. It then selects the best candidate at each step to continue the generation process. This contrasts with the classic “best-of-N” approach, in which the model is asked to generate several complete answers before selecting one.

“It is in particular the structured design of the results of LLaVA-o1 that makes this approach feasible, allowing efficient and precise verification at each step,” write the researchers. “This validates the effectiveness of structured output to improve inference time scaling.”

LLaVA-o1 training

To train LLaVA-o1, the researchers compiled a new dataset of approximately 100,000 image-question-answer pairs obtained from several widely used VQA datasets. The dataset covers a variety of tasks, from answering multi-round questions to interpreting graphs and geometric reasoning.

The researchers used GPT-4o to generate the detailed four-step reasoning processes for each example, including summary, caption, reasoning, and conclusion steps.

The researchers then refined Lama-3.2-11B-Vision-Instruct on this dataset to obtain the final LLaVA-o1 model. The researchers have not published the model but plan to release the dataset, called LLaVA-o1-100k.

LLaVA-o1 in action

The researchers evaluated LLaVA-o1 on several multimodal reasoning criteria. Despite being trained on only 100,000 examples, LLaVA-o1 showed significant improvements in performance over the baseline Llama model, with an average benchmark score increase of 6.9%.

LLaVA-o1 results — *LLaVA-o1 vs other open and closed models Source: arXiv*

Additionally, floor-level beam search led to additional performance gains, demonstrating the effectiveness of inference time scaling. Due to computing resource constraints, the researchers were only able to test the technique with a beam size of 2. They expect even greater improvements with larger beam sizes.

Impressively, LLaVA-o1 outperformed not only other open source models of the same size or larger, but also some closed-source models like GPT-4-o-mini And Gemini 1.5 Pro.

“LLaVA-o1 sets a new standard for multimodal reasoning in VLMs, providing robust performance and scalability, particularly in terms of inference time,” the researchers write. “Our work paves the way for future research on structured reasoning in VLMs, including potential extensions with external verifiers and the use of reinforcement learning to further improve complex multimodal reasoning capabilities.”

VB Daily

Stay informed! Get the latest news delivered to your inbox daily

By subscribing, you agree to the VentureBeat Terms of Service Terms of Use.

Thank you for subscribing. Learn more VB newsletters here.

An error has occurred.

Chinese researchers unveil LLaVA-o1 to challenge OpenAI’s o1 model

OPENAI has filed a brand request for humanoid robots and VR headsets based on AI

Number of chatgpt users (January 2025)

Watch Altman from Openai: America will go well after the elections

“Some missing zeros”: Sam Altman takes a search for the claim of China’s deep “lowcost” | Tendency

Openai’s “deep research” gives students a whole new way of cheating on papers

Download: Depending on the example of Deepseek and the new Openai – MIT TECHNOLOGY Review research agent

2024 Technological trends predictions of the best VC

Startup of Wiener Für Veganen Käse Holt 4.5 million euros

Nvidia Stock appears to be the boom in technological expenditure cools deep fears

Latest

2024 Technological trends predictions of the best VC

Startup of Wiener Für Veganen Käse Holt 4.5 million euros

Nvidia Stock appears to be the boom in technological expenditure cools deep fears

Subscribe to Updates

Subscribe To Updates

Chinese researchers unveil LLaVA-o1 to challenge OpenAI’s o1 model

Multi-step reasoning

LLaVA-o1 training

LLaVA-o1 in action

Related Posts

Subscribe to Updates