History-Guided Iterative Visual Reasoning with Self-Correction
Xinglong Yang, Zhilin Peng, Zhanzhan Liu, Haochen Shi, Sheng-Jun Huang
TL;DR
The paper tackles reliability in multimodal reasoning by introducing H-GIVR, which combines history-guided iterative reasoning with self-correction. It uses a Visual Description module, an Image Re-observation mechanism, and a Consistency-Iterative Reasoning process, plus an Answer Confirmation mechanism, to dynamically refine answers. The approach yields significant accuracy gains across five VQA datasets and three models while maintaining low computational cost, demonstrating practical benefits for cross-modal reasoning without heavy prompt engineering. This work advances inference-time self-correction for multimodal LLMs, supporting more robust, scalable cross-modal reasoning in real-world applications.
Abstract
Self-consistency methods are the core technique for improving the reasoning reliability of multimodal large language models (MLLMs). By generating multiple reasoning results through repeated sampling and selecting the best answer via voting, they play an important role in cross-modal tasks. However, most existing self-consistency methods are limited to a fixed ``repeated sampling and voting'' paradigm and do not reuse historical reasoning information. As a result, models struggle to actively correct visual understanding errors and dynamically adjust their reasoning during iteration. Inspired by the human reasoning behavior of repeated verification and dynamic error correction, we propose the H-GIVR framework. During iterative reasoning, the MLLM observes the image multiple times and uses previously generated answers as references for subsequent steps, enabling dynamic correction of errors and improving answer accuracy. We conduct comprehensive experiments on five datasets and three models. The results show that the H-GIVR framework can significantly improve cross-modal reasoning accuracy while maintaining low computational cost. For instance, using \texttt{Llama3.2-vision:11b} on the ScienceQA dataset, the model requires an average of 2.57 responses per question to achieve an accuracy of 78.90\%, representing a 107\% improvement over the baseline.
