Table of Contents
Fetching ...

ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

Jingqi Zhou, Sheng Wang, Jingwei Dong, Kai Liu, Lei Li, Jiahui Gao, Jiyue Jiang, Lingpeng Kong, Chuan Wu

TL;DR

This work addresses the challenge of LVLMs failing to integrate visual information with textual reasoning by proposing ProReason, a decoupled multi-modal reasoning framework that separates proactive visual perception from textual reasoning. By employing specialized sub-agents (Dispatcher, Vision Expert, Insight Expert, Referee, Summarizer) and a Memory component, ProReason enables iterative, question-driven information gathering and robust final reasoning, while seamlessly leveraging existing LLMs for enhanced performance. Across four diverse visual reasoning benchmarks with open- and closed-source models, ProReason achieves substantial improvements (average around 13.2%), outperforms simultaneous-use baselines, and enables high-quality data generation that improves downstream models (e.g., ProReason-VL and ProReason-Q3). The results demonstrate the feasibility and value of LLM-assisted reasoning in LVLMs, highlight the importance of decoupling perception from reasoning, and suggest promising directions for future visual reasoning research and downstream task enhancement.

Abstract

Large vision-language models (LVLMs) have witnessed significant progress on visual understanding tasks. However, they often prioritize language knowledge over image information on visual reasoning tasks, incurring performance degradation. To tackle this issue, we first identify the drawbacks of existing solutions (i.e., limited multi-modal reasoning capacities, and insufficient and irrelevant visual descriptions). We then decompose visual reasoning process into two stages: proactive visual perception (i.e., eyesight) and textual reasoning (i.e., wisdom), and introduce a novel visual reasoning framework named ProReason. This framework features decoupled vision-reasoning capabilities and multi-run proactive perception. Briefly, given a multi-modal question, ProReason iterates proactive information collection and reasoning until the answer can be concluded with necessary and sufficient visual descriptions. Notably, the disassociation of capabilities allows seamless integration of existing large language models (LLMs) to compensate for the reasoning deficits of LVLMs. Our extensive experiments demonstrate that ProReason outperforms existing multi-step reasoning frameworks on various benchmarks for both open-source and closed-source models, with the average performance gain reaching 13.2%. Besides, the integration of LLMs allows ProReason to produce high-quality visual reasoning data, which empowers ProReason-distilled models (i.e., ProReason-VL and ProReason-Q3) to achieve superior performance in downstream tasks. Our insights into existing solutions and the decoupled perspective for feasible integration of LLMs illuminate future research on visual reasoning techniques, especially LLM-assisted ones.

ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

TL;DR

This work addresses the challenge of LVLMs failing to integrate visual information with textual reasoning by proposing ProReason, a decoupled multi-modal reasoning framework that separates proactive visual perception from textual reasoning. By employing specialized sub-agents (Dispatcher, Vision Expert, Insight Expert, Referee, Summarizer) and a Memory component, ProReason enables iterative, question-driven information gathering and robust final reasoning, while seamlessly leveraging existing LLMs for enhanced performance. Across four diverse visual reasoning benchmarks with open- and closed-source models, ProReason achieves substantial improvements (average around 13.2%), outperforms simultaneous-use baselines, and enables high-quality data generation that improves downstream models (e.g., ProReason-VL and ProReason-Q3). The results demonstrate the feasibility and value of LLM-assisted reasoning in LVLMs, highlight the importance of decoupling perception from reasoning, and suggest promising directions for future visual reasoning research and downstream task enhancement.

Abstract

Large vision-language models (LVLMs) have witnessed significant progress on visual understanding tasks. However, they often prioritize language knowledge over image information on visual reasoning tasks, incurring performance degradation. To tackle this issue, we first identify the drawbacks of existing solutions (i.e., limited multi-modal reasoning capacities, and insufficient and irrelevant visual descriptions). We then decompose visual reasoning process into two stages: proactive visual perception (i.e., eyesight) and textual reasoning (i.e., wisdom), and introduce a novel visual reasoning framework named ProReason. This framework features decoupled vision-reasoning capabilities and multi-run proactive perception. Briefly, given a multi-modal question, ProReason iterates proactive information collection and reasoning until the answer can be concluded with necessary and sufficient visual descriptions. Notably, the disassociation of capabilities allows seamless integration of existing large language models (LLMs) to compensate for the reasoning deficits of LVLMs. Our extensive experiments demonstrate that ProReason outperforms existing multi-step reasoning frameworks on various benchmarks for both open-source and closed-source models, with the average performance gain reaching 13.2%. Besides, the integration of LLMs allows ProReason to produce high-quality visual reasoning data, which empowers ProReason-distilled models (i.e., ProReason-VL and ProReason-Q3) to achieve superior performance in downstream tasks. Our insights into existing solutions and the decoupled perspective for feasible integration of LLMs illuminate future research on visual reasoning techniques, especially LLM-assisted ones.

Paper Structure

This paper contains 65 sections, 16 figures, 15 tables.

Figures (16)

  • Figure 1: Overview and comparison of ProReason, VDGD and ReAct. Unlike existing works (e.g., VDGD and ReAct), our proposed method decouples visual perception and textual reasoning while allowing the model to actively acquire necessary information from the images, achieving superior performance.
  • Figure 2: An example with three reasoning frameworks: fine-grained caption, chain-of-thought, and ProReason. ProReason enables LVLMs to proactively acquire necessary information in a question-oriented manner, and predicts answers based on the collected information. Apparently, ProReason is superior to previous methods, which often describe question-irrelevant visual details, or overlook informative elements. Green indicates correct information or conclusions, while red signifies incorrect ones.
  • Figure 3: A complete reasoning process of ProReason for the case shown in Figure \ref{['fig:case']}.
  • Figure 4: Additional examples of images-unrelated Chain-of-Thought reasoning.
  • Figure 6: A typical mistake made by ProReason. The vision expert incorrectly identifies 4:30 as 6:25, leading the other agents to base their judgments on this erroneous information, and ultimately resulting in the wrong conclusion.
  • ...and 11 more figures