Table of Contents
Fetching ...

History-Guided Iterative Visual Reasoning with Self-Correction

Xinglong Yang, Zhilin Peng, Zhanzhan Liu, Haochen Shi, Sheng-Jun Huang

TL;DR

The paper tackles reliability in multimodal reasoning by introducing H-GIVR, which combines history-guided iterative reasoning with self-correction. It uses a Visual Description module, an Image Re-observation mechanism, and a Consistency-Iterative Reasoning process, plus an Answer Confirmation mechanism, to dynamically refine answers. The approach yields significant accuracy gains across five VQA datasets and three models while maintaining low computational cost, demonstrating practical benefits for cross-modal reasoning without heavy prompt engineering. This work advances inference-time self-correction for multimodal LLMs, supporting more robust, scalable cross-modal reasoning in real-world applications.

Abstract

Self-consistency methods are the core technique for improving the reasoning reliability of multimodal large language models (MLLMs). By generating multiple reasoning results through repeated sampling and selecting the best answer via voting, they play an important role in cross-modal tasks. However, most existing self-consistency methods are limited to a fixed ``repeated sampling and voting'' paradigm and do not reuse historical reasoning information. As a result, models struggle to actively correct visual understanding errors and dynamically adjust their reasoning during iteration. Inspired by the human reasoning behavior of repeated verification and dynamic error correction, we propose the H-GIVR framework. During iterative reasoning, the MLLM observes the image multiple times and uses previously generated answers as references for subsequent steps, enabling dynamic correction of errors and improving answer accuracy. We conduct comprehensive experiments on five datasets and three models. The results show that the H-GIVR framework can significantly improve cross-modal reasoning accuracy while maintaining low computational cost. For instance, using \texttt{Llama3.2-vision:11b} on the ScienceQA dataset, the model requires an average of 2.57 responses per question to achieve an accuracy of 78.90\%, representing a 107\% improvement over the baseline.

History-Guided Iterative Visual Reasoning with Self-Correction

TL;DR

The paper tackles reliability in multimodal reasoning by introducing H-GIVR, which combines history-guided iterative reasoning with self-correction. It uses a Visual Description module, an Image Re-observation mechanism, and a Consistency-Iterative Reasoning process, plus an Answer Confirmation mechanism, to dynamically refine answers. The approach yields significant accuracy gains across five VQA datasets and three models while maintaining low computational cost, demonstrating practical benefits for cross-modal reasoning without heavy prompt engineering. This work advances inference-time self-correction for multimodal LLMs, supporting more robust, scalable cross-modal reasoning in real-world applications.

Abstract

Self-consistency methods are the core technique for improving the reasoning reliability of multimodal large language models (MLLMs). By generating multiple reasoning results through repeated sampling and selecting the best answer via voting, they play an important role in cross-modal tasks. However, most existing self-consistency methods are limited to a fixed ``repeated sampling and voting'' paradigm and do not reuse historical reasoning information. As a result, models struggle to actively correct visual understanding errors and dynamically adjust their reasoning during iteration. Inspired by the human reasoning behavior of repeated verification and dynamic error correction, we propose the H-GIVR framework. During iterative reasoning, the MLLM observes the image multiple times and uses previously generated answers as references for subsequent steps, enabling dynamic correction of errors and improving answer accuracy. We conduct comprehensive experiments on five datasets and three models. The results show that the H-GIVR framework can significantly improve cross-modal reasoning accuracy while maintaining low computational cost. For instance, using \texttt{Llama3.2-vision:11b} on the ScienceQA dataset, the model requires an average of 2.57 responses per question to achieve an accuracy of 78.90\%, representing a 107\% improvement over the baseline.
Paper Structure (24 sections, 2 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 24 sections, 2 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: The detailed process by which the H-GIVR framework answers a VQA question.
  • Figure 2: The illustration of H-GIVR framework. Visual Description and Consistency-Iterative Reasoning serve as the two core components, which aim to deepen the model’s understanding of images and to simulate human review behavior during the answering process; RGB]218,207,223Image Re-observation Mechanism and RGB]218,207,223Answer Confirmation Mechanism act as two key decision mechanisms, which are designed to reduce the model’s misunderstanding of visual information and to determine the final answer.
  • Figure 3: The average number of model calls required by H-GIVR to process each question across three models, five datasets, and three application settings. The bar chart shows the average number of model calls, while the line chart illustrates how the average number of calls varies across the five datasets.
  • Figure 4: Performance differences between H-GIVR and other baseline methods across fine-grained domains.