Boosting Multimodal Reasoning with Automated Structured Thinking
Jinyang Wu, Mingkuan Feng, Shuai Zhang, Fangrui Lv, Ruihan Jin, Feihu Che, Zengqi Wen, Jianhua Tao
TL;DR
This work tackles the efficiency-performance gap in multimodal reasoning by introducing AStar, a training-free framework that uses a compact library of thought cards distilled from 500 samples via Monte Carlo Tree Search. By adaptively selecting five high-level reasoning templates based on problem attributes and integrating external guidance with the model's internal reasoning, AStar achieves competitive or superior results on MathVerse and MathVision with a 7B backbone and without large-scale training. The approach demonstrates strong generalization to visual perception tasks and enables plug-and-play integration with post-training methods like GRPO, offering a scalable pathway to more capable and accessible multimodal reasoning systems. Ablation and OOD experiments further underscore the robustness, efficiency, and broad applicability of the thought-card paradigm.
Abstract
Multimodal large language models excel across diverse domains but struggle with complex visual reasoning tasks. Current approaches aim to incorporate structured thinking via two strategies: explicit search methods and post-training techniques. However, both approaches face significant limitations: Search-based methods suffer from computational inefficiency due to extensive solution space exploration, while post-training methods require substantial data, computational resources, and often encounter training instability. To address these limitations, we propose AStar, an \textbf{A}utomated \textbf{S}tructured \textbf{t}hinking paradigm for multimod\textbf{a}l \textbf{r}easoning. Our method introduces "thought cards", a lightweight library of high-level reasoning patterns abstracted from 500 prior samples using Monte Carlo Tree Search. For each test problem, AStar adaptively retrieves the optimal thought cards and seamlessly integrates these external explicit guidelines with the model's internal implicit reasoning capabilities. Extensive experiments demonstrate AStar's effectiveness and efficiency: using only 500 prior samples and a 7B backbone, our training-free framework achieves 53.9$\%$ accuracy on MathVerse (surpassing GPT-4o's 50.2%) and 32.7% on MathVision (versus GPT-4o's 30.4%). Further analysis reveals that AStar generalizes beyond multimodal reasoning to visual perception and understanding domains, and serves as a plug-and-play test-time inference method compatible with mainstream post-training techniques like GRPO.
