Table of Contents
Fetching ...

Boosting Multimodal Reasoning with Automated Structured Thinking

Jinyang Wu, Mingkuan Feng, Shuai Zhang, Fangrui Lv, Ruihan Jin, Feihu Che, Zengqi Wen, Jianhua Tao

TL;DR

This work tackles the efficiency-performance gap in multimodal reasoning by introducing AStar, a training-free framework that uses a compact library of thought cards distilled from 500 samples via Monte Carlo Tree Search. By adaptively selecting five high-level reasoning templates based on problem attributes and integrating external guidance with the model's internal reasoning, AStar achieves competitive or superior results on MathVerse and MathVision with a 7B backbone and without large-scale training. The approach demonstrates strong generalization to visual perception tasks and enables plug-and-play integration with post-training methods like GRPO, offering a scalable pathway to more capable and accessible multimodal reasoning systems. Ablation and OOD experiments further underscore the robustness, efficiency, and broad applicability of the thought-card paradigm.

Abstract

Multimodal large language models excel across diverse domains but struggle with complex visual reasoning tasks. Current approaches aim to incorporate structured thinking via two strategies: explicit search methods and post-training techniques. However, both approaches face significant limitations: Search-based methods suffer from computational inefficiency due to extensive solution space exploration, while post-training methods require substantial data, computational resources, and often encounter training instability. To address these limitations, we propose AStar, an \textbf{A}utomated \textbf{S}tructured \textbf{t}hinking paradigm for multimod\textbf{a}l \textbf{r}easoning. Our method introduces "thought cards", a lightweight library of high-level reasoning patterns abstracted from 500 prior samples using Monte Carlo Tree Search. For each test problem, AStar adaptively retrieves the optimal thought cards and seamlessly integrates these external explicit guidelines with the model's internal implicit reasoning capabilities. Extensive experiments demonstrate AStar's effectiveness and efficiency: using only 500 prior samples and a 7B backbone, our training-free framework achieves 53.9$\%$ accuracy on MathVerse (surpassing GPT-4o's 50.2%) and 32.7% on MathVision (versus GPT-4o's 30.4%). Further analysis reveals that AStar generalizes beyond multimodal reasoning to visual perception and understanding domains, and serves as a plug-and-play test-time inference method compatible with mainstream post-training techniques like GRPO.

Boosting Multimodal Reasoning with Automated Structured Thinking

TL;DR

This work tackles the efficiency-performance gap in multimodal reasoning by introducing AStar, a training-free framework that uses a compact library of thought cards distilled from 500 samples via Monte Carlo Tree Search. By adaptively selecting five high-level reasoning templates based on problem attributes and integrating external guidance with the model's internal reasoning, AStar achieves competitive or superior results on MathVerse and MathVision with a 7B backbone and without large-scale training. The approach demonstrates strong generalization to visual perception tasks and enables plug-and-play integration with post-training methods like GRPO, offering a scalable pathway to more capable and accessible multimodal reasoning systems. Ablation and OOD experiments further underscore the robustness, efficiency, and broad applicability of the thought-card paradigm.

Abstract

Multimodal large language models excel across diverse domains but struggle with complex visual reasoning tasks. Current approaches aim to incorporate structured thinking via two strategies: explicit search methods and post-training techniques. However, both approaches face significant limitations: Search-based methods suffer from computational inefficiency due to extensive solution space exploration, while post-training methods require substantial data, computational resources, and often encounter training instability. To address these limitations, we propose AStar, an \textbf{A}utomated \textbf{S}tructured \textbf{t}hinking paradigm for multimod\textbf{a}l \textbf{r}easoning. Our method introduces "thought cards", a lightweight library of high-level reasoning patterns abstracted from 500 prior samples using Monte Carlo Tree Search. For each test problem, AStar adaptively retrieves the optimal thought cards and seamlessly integrates these external explicit guidelines with the model's internal implicit reasoning capabilities. Extensive experiments demonstrate AStar's effectiveness and efficiency: using only 500 prior samples and a 7B backbone, our training-free framework achieves 53.9 accuracy on MathVerse (surpassing GPT-4o's 50.2%) and 32.7% on MathVision (versus GPT-4o's 30.4%). Further analysis reveals that AStar generalizes beyond multimodal reasoning to visual perception and understanding domains, and serves as a plug-and-play test-time inference method compatible with mainstream post-training techniques like GRPO.

Paper Structure

This paper contains 34 sections, 17 equations, 13 figures, 13 tables, 1 algorithm.

Figures (13)

  • Figure 1: Evaluation results on MathVerse. AStar makes 7B models competent problem-solvers, surpassing GPT-4o. Our approach demonstrates consistent effectiveness across multiple model architectures and scales.
  • Figure 2: Flowchart of our method AStar. This framework consists of three parts: (1) Visual Reasoning Action Definition; (2) Thought Card Construction; (3) Adaptive Reasoning and Verification.
  • Figure 3: Results on the challenging reasoning benchmark, MathVision. AStar-7B (Qwen2.5-7b) achieves competitive performance to GPT-4o.
  • Figure 4: Comparison between AStar and powerful MLLMs across 3 challenging benchmarks: MathVista, MathVerse, and MathVision. 'OS' and 'CS' denote open-source and closed-source models. AStar with 7B backbone outperforms them.
  • Figure 4: Ablation results on AStar-7B (Qwen2.5-7B). 'RAC', 'RC', 'RS', 'SC' denotes 'random action combinations', 'random card', 'random selection', and 'self-consistency', respectively. We observe that every component is important for optimal performance.
  • ...and 8 more figures