Table of Contents
Fetching ...

The Thinking Boundary: Quantifying Reasoning Suitability of Multimodal Tasks via Dual Tuning

Ruobing Zheng, Tianqi Li, Jianing Li, Qingpei Guo, Yi Yuan, Jingdong Chen

TL;DR

Dual Tuning is proposed, a framework designed to assess whether reasoning yields positive gains for target tasks under given base models and datasets, and the "Thinking Boundary" is established to evaluate the suitability of reasoning training across diverse multimodal tasks, including spatial, mathematical, and multi-disciplinary domains.

Abstract

While reasoning-enhanced Large Language Models (LLMs) have demonstrated remarkable advances in complex tasks such as mathematics and coding, their effectiveness across universal multimodal scenarios remains uncertain. The trend of releasing parallel "Instruct" and "Thinking" models by leading developers serves merely as a resource-intensive workaround, stemming from the lack of a criterion for determining when reasoning is truly beneficial. In this paper, we propose Dual Tuning, a framework designed to assess whether reasoning yields positive gains for target tasks under given base models and datasets. By jointly fine-tuning on paired Chain-of-Thought (CoT) and Direct-Answer (DA) data under controlled prompts, we systematically quantify and compare the gains of both training modes using the proposed metrics, and establish the "Thinking Boundary" to evaluate the suitability of reasoning training across diverse multimodal tasks, including spatial, mathematical, and multi-disciplinary domains. We further explore the impact of reinforcement training and thinking patterns on reasoning suitability, and validate whether the "Thinking Boundary" can guide data refinement. Our findings challenge the "reasoning-for-all" paradigm, providing practical guidance for identifying appropriate data and training strategies, and motivating the development of resource-efficient, adaptive auto-think systems.

The Thinking Boundary: Quantifying Reasoning Suitability of Multimodal Tasks via Dual Tuning

TL;DR

Dual Tuning is proposed, a framework designed to assess whether reasoning yields positive gains for target tasks under given base models and datasets, and the "Thinking Boundary" is established to evaluate the suitability of reasoning training across diverse multimodal tasks, including spatial, mathematical, and multi-disciplinary domains.

Abstract

While reasoning-enhanced Large Language Models (LLMs) have demonstrated remarkable advances in complex tasks such as mathematics and coding, their effectiveness across universal multimodal scenarios remains uncertain. The trend of releasing parallel "Instruct" and "Thinking" models by leading developers serves merely as a resource-intensive workaround, stemming from the lack of a criterion for determining when reasoning is truly beneficial. In this paper, we propose Dual Tuning, a framework designed to assess whether reasoning yields positive gains for target tasks under given base models and datasets. By jointly fine-tuning on paired Chain-of-Thought (CoT) and Direct-Answer (DA) data under controlled prompts, we systematically quantify and compare the gains of both training modes using the proposed metrics, and establish the "Thinking Boundary" to evaluate the suitability of reasoning training across diverse multimodal tasks, including spatial, mathematical, and multi-disciplinary domains. We further explore the impact of reinforcement training and thinking patterns on reasoning suitability, and validate whether the "Thinking Boundary" can guide data refinement. Our findings challenge the "reasoning-for-all" paradigm, providing practical guidance for identifying appropriate data and training strategies, and motivating the development of resource-efficient, adaptive auto-think systems.
Paper Structure (15 sections, 5 equations, 7 figures, 10 tables)

This paper contains 15 sections, 5 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: The base model shows discrepancies in initial performance between CoT and DA inference across various tasks. Positive values indicate that CoT inference has an advantage.
  • Figure 2: We plot each task's $\mathbf{Gain_{CoT}}$ and $\mathbf{Gain_{DA}}$ in a two-dimensional coordinate map. Through three distinct regions, we categorize the suitability of different tasks for the two training modes.
  • Figure 3: We evaluated on two different datasets, marked by circles (original) and triangles (new) on MMMU. The resulting change in task distribution highlights how Thinking Patterns dictate reasoning suitability across different tasks.
  • Figure 4: The effectiveness of a thinking pattern depends on its refinement and the exclusion of redundant or invalid reasoning. We compare the $\mathbf{Gain_{token}}$ for both datasets on MathVista tasks.
  • Figure 5: We partition tasks into two halves using $\mathbf{Gain_{DA}}$ from Figure \ref{['fig:mmmu_sca']} and conduct two separate DA training on the data belonging to each half. The results show that left-side tasks predominantly show negative gains and right-side positive tasks mostly achieve positive gains after standalone training, which confirms the efficacy of the corresponding data.
  • ...and 2 more figures