Table of Contents
Fetching ...

On the Faithfulness of Visual Thinking: Measurement and Enhancement

Zujing Liu, Junwen Pan, Qi She, Yuan Gao, Guisong Xia

TL;DR

This work reveals a gap in the faithfulness of visual–text multimodal chain‑of‑thought (MCoT) reasoning, showing that current RL fine‑tuning rewards encourage interleaved visual cues without ensuring their correctness or sufficiency. It introduces a causal‑intervention evaluation framework and two LVLM‑based metrics (reliability and sufficiency) to quantify visual faithfulness, finding that visual information is often unreliable and insufficient. To address this, the authors propose Sufficient‑Component Cause Model (SCCM) learning, which enforces visual components to be a sufficient and minimal cause for the correct answer via dedicated rewards $r_s$ and $r_m$, forming a final objective that is annotation‑free and plug‑and‑play. Empirically, SCCM improves visual faithfulness and accuracy across fine‑grained perception and reasoning benchmarks (e.g., V* Bench, HR‑Bench) and outperforms strong baselines, offering a practical path toward faithful “thinking with images.”

Abstract

Recent large vision-language models (LVLMs) can generate vision-text multimodal chain-of-thought (MCoT) traces after reinforcement fine-tuning (RFT). However, we observe that the visual information incorporated in MCoT is often inaccurate, though still yield correct answers, indicating a lack of faithfulness in the MCoT reasoning process. We attribute this unfaithfulness to the RL reward in RFT, which solely incentivizes the format of interleaved vision-text cues, ie, it encourages the model to incorporate visual information into its text reasoning steps without considering the correctness of the visual information. In this paper, we first probe the faithfulness of MCoT by measuring how much the prediction changes when its visual and textual thoughts are intervened. Surprisingly, the model's predictions remain nearly unchanged under visual intervention but change significantly under textual intervention, indicating that the visual evidence is largely ignored. To further analyze visual information, we introduce an automated LVLM-based evaluation metric that quantifies the faithfulness of visual cues from two perspectives: reliability and sufficiency. Our evaluation reveals that the visual information in current MCoT traces is simultaneously unreliable and insufficient. To address this issue, we propose a novel MCoT learning strategy termed Sufficient-Component Cause Model (SCCM) learning. This approach encourages the MCoT to generate sufficient yet minimal visual components that are independently capable of leading to correct answers. We note that the proposed SCCM is annotation-free and compatible with various RFT for MCoT in a plug-and-play manner. Empirical results demonstrate that SCCM consistently improves the visual faithfulness across a suite of fine-grained perception and reasoning benchmarks. Code is available at https://github.com/EugeneLiu01/Faithful_Thinking_with_Image.

On the Faithfulness of Visual Thinking: Measurement and Enhancement

TL;DR

This work reveals a gap in the faithfulness of visual–text multimodal chain‑of‑thought (MCoT) reasoning, showing that current RL fine‑tuning rewards encourage interleaved visual cues without ensuring their correctness or sufficiency. It introduces a causal‑intervention evaluation framework and two LVLM‑based metrics (reliability and sufficiency) to quantify visual faithfulness, finding that visual information is often unreliable and insufficient. To address this, the authors propose Sufficient‑Component Cause Model (SCCM) learning, which enforces visual components to be a sufficient and minimal cause for the correct answer via dedicated rewards and , forming a final objective that is annotation‑free and plug‑and‑play. Empirically, SCCM improves visual faithfulness and accuracy across fine‑grained perception and reasoning benchmarks (e.g., V* Bench, HR‑Bench) and outperforms strong baselines, offering a practical path toward faithful “thinking with images.”

Abstract

Recent large vision-language models (LVLMs) can generate vision-text multimodal chain-of-thought (MCoT) traces after reinforcement fine-tuning (RFT). However, we observe that the visual information incorporated in MCoT is often inaccurate, though still yield correct answers, indicating a lack of faithfulness in the MCoT reasoning process. We attribute this unfaithfulness to the RL reward in RFT, which solely incentivizes the format of interleaved vision-text cues, ie, it encourages the model to incorporate visual information into its text reasoning steps without considering the correctness of the visual information. In this paper, we first probe the faithfulness of MCoT by measuring how much the prediction changes when its visual and textual thoughts are intervened. Surprisingly, the model's predictions remain nearly unchanged under visual intervention but change significantly under textual intervention, indicating that the visual evidence is largely ignored. To further analyze visual information, we introduce an automated LVLM-based evaluation metric that quantifies the faithfulness of visual cues from two perspectives: reliability and sufficiency. Our evaluation reveals that the visual information in current MCoT traces is simultaneously unreliable and insufficient. To address this issue, we propose a novel MCoT learning strategy termed Sufficient-Component Cause Model (SCCM) learning. This approach encourages the MCoT to generate sufficient yet minimal visual components that are independently capable of leading to correct answers. We note that the proposed SCCM is annotation-free and compatible with various RFT for MCoT in a plug-and-play manner. Empirical results demonstrate that SCCM consistently improves the visual faithfulness across a suite of fine-grained perception and reasoning benchmarks. Code is available at https://github.com/EugeneLiu01/Faithful_Thinking_with_Image.

Paper Structure

This paper contains 26 sections, 9 equations, 15 figures, 8 tables.

Figures (15)

  • Figure 1: The mistakes present in the MCoT generated by current works zheng2025deepeyessu2025pixel on V* Bench dataset, can be divided into three categories: 1) irrelevant visual information; 2) incomplete and insufficient visual information; 3) incorrect perception.
  • Figure 2:
  • Figure 3:
  • Figure 5: The overview of our proposed Sufficient-Component Cause Model (SCCM) learning to establish visual information as sufficient-component causes to correct answers. The SCCM framework requires that: 1) the visual information alone is sufficient to lead to the correct answer, enforced by the Visual Information Sufficiency reward $r_s$; and 2) the visual information involved is as minimal as possible, guided by the Visual Information Minimality reward $r_m$.
  • Figure 6:
  • ...and 10 more figures

Theorems & Definitions (3)

  • Definition 1
  • Remark 1
  • Remark 2