On the Faithfulness of Visual Thinking: Measurement and Enhancement

Zujing Liu; Junwen Pan; Qi She; Yuan Gao; Guisong Xia

On the Faithfulness of Visual Thinking: Measurement and Enhancement

Zujing Liu, Junwen Pan, Qi She, Yuan Gao, Guisong Xia

TL;DR

This work reveals a gap in the faithfulness of visual–text multimodal chain‑of‑thought (MCoT) reasoning, showing that current RL fine‑tuning rewards encourage interleaved visual cues without ensuring their correctness or sufficiency. It introduces a causal‑intervention evaluation framework and two LVLM‑based metrics (reliability and sufficiency) to quantify visual faithfulness, finding that visual information is often unreliable and insufficient. To address this, the authors propose Sufficient‑Component Cause Model (SCCM) learning, which enforces visual components to be a sufficient and minimal cause for the correct answer via dedicated rewards $r_s$ and $r_m$, forming a final objective that is annotation‑free and plug‑and‑play. Empirically, SCCM improves visual faithfulness and accuracy across fine‑grained perception and reasoning benchmarks (e.g., V* Bench, HR‑Bench) and outperforms strong baselines, offering a practical path toward faithful “thinking with images.”

Abstract

Recent large vision-language models (LVLMs) can generate vision-text multimodal chain-of-thought (MCoT) traces after reinforcement fine-tuning (RFT). However, we observe that the visual information incorporated in MCoT is often inaccurate, though still yield correct answers, indicating a lack of faithfulness in the MCoT reasoning process. We attribute this unfaithfulness to the RL reward in RFT, which solely incentivizes the format of interleaved vision-text cues, ie, it encourages the model to incorporate visual information into its text reasoning steps without considering the correctness of the visual information. In this paper, we first probe the faithfulness of MCoT by measuring how much the prediction changes when its visual and textual thoughts are intervened. Surprisingly, the model's predictions remain nearly unchanged under visual intervention but change significantly under textual intervention, indicating that the visual evidence is largely ignored. To further analyze visual information, we introduce an automated LVLM-based evaluation metric that quantifies the faithfulness of visual cues from two perspectives: reliability and sufficiency. Our evaluation reveals that the visual information in current MCoT traces is simultaneously unreliable and insufficient. To address this issue, we propose a novel MCoT learning strategy termed Sufficient-Component Cause Model (SCCM) learning. This approach encourages the MCoT to generate sufficient yet minimal visual components that are independently capable of leading to correct answers. We note that the proposed SCCM is annotation-free and compatible with various RFT for MCoT in a plug-and-play manner. Empirical results demonstrate that SCCM consistently improves the visual faithfulness across a suite of fine-grained perception and reasoning benchmarks. Code is available at https://github.com/EugeneLiu01/Faithful_Thinking_with_Image.

On the Faithfulness of Visual Thinking: Measurement and Enhancement

TL;DR

Abstract

On the Faithfulness of Visual Thinking: Measurement and Enhancement

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (15)

Theorems & Definitions (3)