Table of Contents
Fetching ...

From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis

Chuanqi Cheng, Jian Guan, Wei Wu, Rui Yan

TL;DR

A least-to-most visual reasoning paradigm is introduced, which interleaves steps of decomposing a question into sub-questions and invoking external tools for resolving sub-questions, and a novel data synthesis approach is proposed that can automatically create questions and multi-step reasoning paths for an image in a bottom-up manner.

Abstract

We explore multi-step reasoning in vision-language models (VLMs). The problem is challenging, as reasoning data consisting of multiple steps of visual and language processing are barely available. To overcome the challenge, we first introduce a least-to-most visual reasoning paradigm, which interleaves steps of decomposing a question into sub-questions and invoking external tools for resolving sub-questions. Based on the paradigm, we further propose a novel data synthesis approach that can automatically create questions and multi-step reasoning paths for an image in a bottom-up manner. Our approach divides the complex synthesis task into a few simple sub-tasks, and (almost entirely) relies on open-sourced models to accomplish the sub-tasks. Therefore, the entire synthesis process is reproducible and cost-efficient, and the synthesized data is quality guaranteed. With the approach, we construct $50$k visual reasoning examples. Then, we develop a visual reasoner through supervised fine-tuning, which is capable of generally enhancing the reasoning abilities of a wide range of existing VLMs in a plug-and-play fashion. Extensive experiments indicate that the visual reasoner can consistently and significantly improve four VLMs on four VQA benchmarks. Our code and dataset are available at https://github.com/steven-ccq/VisualReasoner.

From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis

TL;DR

A least-to-most visual reasoning paradigm is introduced, which interleaves steps of decomposing a question into sub-questions and invoking external tools for resolving sub-questions, and a novel data synthesis approach is proposed that can automatically create questions and multi-step reasoning paths for an image in a bottom-up manner.

Abstract

We explore multi-step reasoning in vision-language models (VLMs). The problem is challenging, as reasoning data consisting of multiple steps of visual and language processing are barely available. To overcome the challenge, we first introduce a least-to-most visual reasoning paradigm, which interleaves steps of decomposing a question into sub-questions and invoking external tools for resolving sub-questions. Based on the paradigm, we further propose a novel data synthesis approach that can automatically create questions and multi-step reasoning paths for an image in a bottom-up manner. Our approach divides the complex synthesis task into a few simple sub-tasks, and (almost entirely) relies on open-sourced models to accomplish the sub-tasks. Therefore, the entire synthesis process is reproducible and cost-efficient, and the synthesized data is quality guaranteed. With the approach, we construct k visual reasoning examples. Then, we develop a visual reasoner through supervised fine-tuning, which is capable of generally enhancing the reasoning abilities of a wide range of existing VLMs in a plug-and-play fashion. Extensive experiments indicate that the visual reasoner can consistently and significantly improve four VLMs on four VQA benchmarks. Our code and dataset are available at https://github.com/steven-ccq/VisualReasoner.
Paper Structure (38 sections, 4 equations, 8 figures, 6 tables)

This paper contains 38 sections, 4 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Top Left: An example from TextVQA singh2019towards with "2010" as the ground truth. Middle Left: Response from LLaVA-NeXT-13B liu2024llavanext. Bottom Left: Response from GPT-4o with the prompt "{question} please think step by step and answer the question". Right: Response given by the proposed method, which is also the only correct answer.
  • Figure 2: Left: the pipeline of least-to-most synthesis. Right: the process of least-to-most visual reasoning.
  • Figure 3: The performance of the Reasoner varies with the size of Vireo. The dashed lines indicate the performance of corresponding models using 50k training examples.
  • Figure 4: The distribution of different error types.
  • Figure 5: Case 1. This case uses Grounding to locate the donut and uses OCR and Answer to get the final answer.
  • ...and 3 more figures