Table of Contents
Fetching ...

Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models

Yufei Zhan, Hongyin Zhao, Yousong Zhu, Shurong Zheng, Fan Yang, Ming Tang, Jinqiao Wang

TL;DR

Griffon-R tackles the limitation of current LMMs in compositional visual reasoning by introducing a unified understand-think-answer mechanism that performs end-to-end reasoning in a single forward pass. A semi-automatic expert-supervised data engine curates 334K visual reasoning samples to train the model, aligning data with the mechanism. Empirical results show state-of-the-art or competitive performance on both compositional visual reasoning benchmarks (CLEVR, VSR) and multimodal benchmarks (MMBench, ScienceQA), demonstrating broad visual reasoning and multimodal capabilities. This approach reduces dependency on external tools and indicates a path toward more general, faithful, and efficient vision-language systems.

Abstract

Large Multimodal Models (LMMs) have recently demonstrated remarkable visual understanding performance on both vision-language and vision-centric tasks. However, they often fall short in integrating advanced, task-specific capabilities for compositional reasoning, which hinders their progress toward truly competent general vision models. To address this, we present a unified visual reasoning mechanism that enables LMMs to solve complicated compositional problems by leveraging their intrinsic capabilities (e.g. grounding and visual understanding capabilities). Different from the previous shortcut learning mechanism, our approach introduces a human-like understanding-thinking-answering process, allowing the model to complete all steps in a single pass forwarding without the need for multiple inferences or external tools. This design bridges the gap between foundational visual capabilities and general question answering, encouraging LMMs to generate faithful and traceable responses for complex visual reasoning. Meanwhile, we curate 334K visual instruction samples covering both general scenes and text-rich scenes and involving multiple foundational visual capabilities. Our trained model, Griffon-R, has the ability of end-to-end automatic understanding, self-thinking, and reasoning answers. Comprehensive experiments show that Griffon-R not only achieves advancing performance on complex visual reasoning benchmarks including VSR and CLEVR, but also enhances multimodal capabilities across various benchmarks like MMBench and ScienceQA. Data, models, and codes will be release at https://github.com/jefferyZhan/Griffon/tree/master/Griffon-R soon.

Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models

TL;DR

Griffon-R tackles the limitation of current LMMs in compositional visual reasoning by introducing a unified understand-think-answer mechanism that performs end-to-end reasoning in a single forward pass. A semi-automatic expert-supervised data engine curates 334K visual reasoning samples to train the model, aligning data with the mechanism. Empirical results show state-of-the-art or competitive performance on both compositional visual reasoning benchmarks (CLEVR, VSR) and multimodal benchmarks (MMBench, ScienceQA), demonstrating broad visual reasoning and multimodal capabilities. This approach reduces dependency on external tools and indicates a path toward more general, faithful, and efficient vision-language systems.

Abstract

Large Multimodal Models (LMMs) have recently demonstrated remarkable visual understanding performance on both vision-language and vision-centric tasks. However, they often fall short in integrating advanced, task-specific capabilities for compositional reasoning, which hinders their progress toward truly competent general vision models. To address this, we present a unified visual reasoning mechanism that enables LMMs to solve complicated compositional problems by leveraging their intrinsic capabilities (e.g. grounding and visual understanding capabilities). Different from the previous shortcut learning mechanism, our approach introduces a human-like understanding-thinking-answering process, allowing the model to complete all steps in a single pass forwarding without the need for multiple inferences or external tools. This design bridges the gap between foundational visual capabilities and general question answering, encouraging LMMs to generate faithful and traceable responses for complex visual reasoning. Meanwhile, we curate 334K visual instruction samples covering both general scenes and text-rich scenes and involving multiple foundational visual capabilities. Our trained model, Griffon-R, has the ability of end-to-end automatic understanding, self-thinking, and reasoning answers. Comprehensive experiments show that Griffon-R not only achieves advancing performance on complex visual reasoning benchmarks including VSR and CLEVR, but also enhances multimodal capabilities across various benchmarks like MMBench and ScienceQA. Data, models, and codes will be release at https://github.com/jefferyZhan/Griffon/tree/master/Griffon-R soon.

Paper Structure

This paper contains 27 sections, 3 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Enabled by the proposed unified mechanism, Griffon-R naturally connects the reasoning processes for locating each balloon with answering the spatial relationship question. It effectively analyzes their y-axis coordinates and provides the correct answer in a single pass.
  • Figure 2: Detailed illustration of the unified visual reasoning mechanism with the "Understand-Think-Answer" process. The key information related to the answer is highlighted or visualized with the green color. We illustrate the details of the designed process in bold, which will not be generated or trained.
  • Figure 3: Illustration of the semi-automatic expert-supervised data generation engine.
  • Figure 4: Ablation on understanding quality. With REC task covering object localization and attribute perception, we choose it to evaluate the quality of understanding in the mechanism.
  • Figure 5: Visualization of Griffon-R's reasoning results. Correct answers are highlighted in bold green, and the relevant information within the long text leading to the answer is bolded.
  • ...and 1 more figures