Table of Contents
Fetching ...

SPD-Faith Bench: Diagnosing and Improving Faithfulness in Chain-of-Thought for Multimodal Large Language Models

Weijiang Lv, Yaoxuan Feng, Xiaobo Xia, Jiayu Wang, Yan Jing, Wenchao Chen, Bo Chen

TL;DR

This work targets faithfulness in multimodal chain-of-thought by introducing SPD-Faith Bench, a diagnostic benchmark that uses fine-grained image-difference reasoning to decouple visual evidence from linguistic priors. The authors identify two pervasive failure modes—perceptual blindness and perception-reasoning dissociation—and diagnose their mechanistic origins in attention decay and residual-stream dynamics. They propose SAGE, a train-free See-Analyze-Generate engine, which dynamically routes visual information, rectifies internal information flow, and grounds generation in visual signals through contrastive decoding. Across 12 MLLMs and multiple benchmarks, SPD-Faith reveals a persistent gap between perception and faithful reasoning, while SAGE yields consistent improvements in faithfulness metrics (CR, DRF) and perception metrics, underscoring the importance of internal reasoning dynamics in evaluating model faithfulness. The work provides a practical pathway toward more trustworthy multimodal reasoning by explicitly evaluating and aligning internal processes with visual evidence.

Abstract

Chain-of-Thought reasoning is widely used to improve the interpretability of multimodal large language models (MLLMs), yet the faithfulness of the generated reasoning traces remains unclear. Prior work has mainly focused on perceptual hallucinations, leaving reasoning level unfaithfulness underexplored. To isolate faithfulness from linguistic priors, we introduce SPD-Faith Bench, a diagnostic benchmark based on fine-grained image difference reasoning that enforces explicit visual comparison. Evaluations on state-of-the-art MLLMs reveal two systematic failure modes, perceptual blindness and perception-reasoning dissociation. We trace these failures to decaying visual attention and representation shifts in the residual stream. Guided by this analysis, we propose SAGE, a train-free visual evidence-calibrated framework that improves visual routing and aligns reasoning with perception. Our results highlight the importance of explicitly evaluating faithfulness beyond response correctness. Our benchmark and codes are available at https://github.com/Johanson-colab/SPD-Faith-Bench.

SPD-Faith Bench: Diagnosing and Improving Faithfulness in Chain-of-Thought for Multimodal Large Language Models

TL;DR

This work targets faithfulness in multimodal chain-of-thought by introducing SPD-Faith Bench, a diagnostic benchmark that uses fine-grained image-difference reasoning to decouple visual evidence from linguistic priors. The authors identify two pervasive failure modes—perceptual blindness and perception-reasoning dissociation—and diagnose their mechanistic origins in attention decay and residual-stream dynamics. They propose SAGE, a train-free See-Analyze-Generate engine, which dynamically routes visual information, rectifies internal information flow, and grounds generation in visual signals through contrastive decoding. Across 12 MLLMs and multiple benchmarks, SPD-Faith reveals a persistent gap between perception and faithful reasoning, while SAGE yields consistent improvements in faithfulness metrics (CR, DRF) and perception metrics, underscoring the importance of internal reasoning dynamics in evaluating model faithfulness. The work provides a practical pathway toward more trustworthy multimodal reasoning by explicitly evaluating and aligning internal processes with visual evidence.

Abstract

Chain-of-Thought reasoning is widely used to improve the interpretability of multimodal large language models (MLLMs), yet the faithfulness of the generated reasoning traces remains unclear. Prior work has mainly focused on perceptual hallucinations, leaving reasoning level unfaithfulness underexplored. To isolate faithfulness from linguistic priors, we introduce SPD-Faith Bench, a diagnostic benchmark based on fine-grained image difference reasoning that enforces explicit visual comparison. Evaluations on state-of-the-art MLLMs reveal two systematic failure modes, perceptual blindness and perception-reasoning dissociation. We trace these failures to decaying visual attention and representation shifts in the residual stream. Guided by this analysis, we propose SAGE, a train-free visual evidence-calibrated framework that improves visual routing and aligns reasoning with perception. Our results highlight the importance of explicitly evaluating faithfulness beyond response correctness. Our benchmark and codes are available at https://github.com/Johanson-colab/SPD-Faith-Bench.
Paper Structure (88 sections, 23 equations, 42 figures, 14 tables)

This paper contains 88 sections, 23 equations, 42 figures, 14 tables.

Figures (42)

  • Figure 1: An illustrative example demonstrating that an MLLM may produce mutually inconsistent binary judgments for an identical image pair, alternately predicting "same" and "different".
  • Figure 2: Construction pipeline of SPD-Faith Bench. The pipeline includes two key phases: data collection and data generation. The benchmark contains paired images with either a single difference or multiple differences (2–5), covering three modification types: color, object removal, and position change. Examples are grouped into easy, medium, and hard splits based on instance-level complexity, enabling fine-grained evaluation of visual comparison and multimodal reasoning.
  • Figure 3: Our evaluation framework offers a comprehensive characterization of multimodal reasoning. It measures global perception (DS, DQR), fine-grained detail sensitivity (TF1, CF1), and response faithfulness (CR, DRF).
  • Figure 4: Failure cases of traditional metrics in fine-grained visual reasoning. The model produces fluent language while generating factually incorrect descriptions of the visual differences.
  • Figure 5: Comprehensive evaluation of MLLMs across three dimensions. Models are evaluated in global perception (DS, DQR), faithful perception (TF1, CF1), and faithful reasoning (CR, DRF).
  • ...and 37 more figures