Table of Contents
Fetching ...

FaithAct: Faithfulness Planning and Acting in MLLMs

Junxian Li, Xinyue Xu, Sai Ma, Di Zhang, Seth Lazar, Sichao Li

TL;DR

The paper tackles unfaithfulness in multimodal large language models (MLLMs) by distinguishing perceptual faithfulness (PF) from behavioral faithfulness (BF) and introducing FaithEval to quantify step- and chain-level grounding. It then presents FaithAct, a faithfulness-first planning framework that verifies evidential grounding at every reasoning step via a lightweight API (Poll, Ground, Select, Abstain, Count) and employs a refine-based loop to ensure decisions are grounded before proceeding. Empirical results on RealWorldQA and MMHal show FaithAct improves perceptual faithfulness by up to up to $26\%$ without compromising task accuracy and reduces object hallucination, outperforming prompt-based and tool-augmented baselines. The work provides a unified framework for evaluating and enforcing faithfulness in multimodal reasoning, with implications for more trustworthy and interpretable MLLMs.

Abstract

Unfaithfulness remains a persistent challenge for large language models (LLMs), which often produce plausible yet ungrounded reasoning chains that diverge from perceptual evidence or final conclusions. We distinguish between behavioral faithfulness (alignment between reasoning and output) and perceptual faithfulness (alignment between reasoning and input), and introduce FaithEval for quantifying step-level and chain-level faithfulness by evaluating whether each claimed object is visually supported by the image. Building on these insights, we propose FaithAct, a faithfulness-first planning and acting framework that enforces evidential grounding at every reasoning step. Experiments across multiple reasoning benchmarks demonstrate that FaithAct improves perceptual faithfulness by up to 26% without degrading task accuracy compared to prompt-based and tool-augmented baselines. Our analysis shows that treating faithfulness as a guiding principle not only mitigates hallucination but also leads to more stable reasoning trajectories. This work thereby establishes a unified framework for both evaluating and enforcing faithfulness in multimodal reasoning.

FaithAct: Faithfulness Planning and Acting in MLLMs

TL;DR

The paper tackles unfaithfulness in multimodal large language models (MLLMs) by distinguishing perceptual faithfulness (PF) from behavioral faithfulness (BF) and introducing FaithEval to quantify step- and chain-level grounding. It then presents FaithAct, a faithfulness-first planning framework that verifies evidential grounding at every reasoning step via a lightweight API (Poll, Ground, Select, Abstain, Count) and employs a refine-based loop to ensure decisions are grounded before proceeding. Empirical results on RealWorldQA and MMHal show FaithAct improves perceptual faithfulness by up to up to without compromising task accuracy and reduces object hallucination, outperforming prompt-based and tool-augmented baselines. The work provides a unified framework for evaluating and enforcing faithfulness in multimodal reasoning, with implications for more trustworthy and interpretable MLLMs.

Abstract

Unfaithfulness remains a persistent challenge for large language models (LLMs), which often produce plausible yet ungrounded reasoning chains that diverge from perceptual evidence or final conclusions. We distinguish between behavioral faithfulness (alignment between reasoning and output) and perceptual faithfulness (alignment between reasoning and input), and introduce FaithEval for quantifying step-level and chain-level faithfulness by evaluating whether each claimed object is visually supported by the image. Building on these insights, we propose FaithAct, a faithfulness-first planning and acting framework that enforces evidential grounding at every reasoning step. Experiments across multiple reasoning benchmarks demonstrate that FaithAct improves perceptual faithfulness by up to 26% without degrading task accuracy compared to prompt-based and tool-augmented baselines. Our analysis shows that treating faithfulness as a guiding principle not only mitigates hallucination but also leads to more stable reasoning trajectories. This work thereby establishes a unified framework for both evaluating and enforcing faithfulness in multimodal reasoning.

Paper Structure

This paper contains 34 sections, 2 theorems, 25 equations, 4 figures, 2 tables, 1 algorithm.

Key Result

Lemma 1

Let $R^{\text{ReAct}} = \{ s_t^{\text{ReAct}} \}_{t=1}^{T}$ and $R^{\text{FaithAct}} = \{ s_t^{\text{FaithAct}} \}_{t=1}^{T'}$ denote the reasoning chains generated by ReAct and FaithAct, respectively. Let $F_{\text{step}}(s_t)$ be the perceptual faithfulness of step $s_t$, and define the chain-leve Assume that FaithAct refines each candidate step $s_t^{(k)}$ using verified evidence such that and

Figures (4)

  • Figure 1: Examples of perceptual and behavioral unfaithfulness. Left: Behavioral unfaithful and perceptual unfaithful. The model presents a step-by-step reasoning trace describing alternating arrow directions and increasing dot counts, yet such reasoning does not reflect its actual decision process. The final choice (A) is likely made through pattern association, with the explanation generated post hoc to rationalize it. Right: Behavioral faithful but perceptual unfaithful. The model describes the bicycle as yellow, influenced by the nearby yellow bus, even though the bicycle is gray. This illustrates a visually plausible but perception-unfaithful reasoning step, where linguistic association overrides perceptual grounding.
  • Figure 2: Overview of the proposed FaithAct framework. FaithAct enforces faithfulness-first multimodal reasoning through two sequential phases, namely verification and action. Left: The Quantifying Faithfulness stage (Sec. 4) extracts claimed objects from each reasoning step and verifies their perceptual grounding in the image using callable functions such as Poll() and Ground(). Right: The Faithful-First Planning stage (Sec. 5) refines reasoning by selectively retaining only evidence-supported objects through structured APIs (Select(), Abstain(), Count()). This design ensures faithfulness as a principle, where every reasoning step is evidentially justified before contributing to the final answer.
  • Figure 3: Distribution of average $F_{\text{step}}$ difference across reasoning steps. The x-axis are reasoning steps and y-axis represents the $F_{\text{step}}$ averaged difference between Qwen-2.5-VL-7B with and without FaithAct.
  • Figure 4: Qualitative comparison of reasoning chains generated with and without FaithAct on two illustrative cases. In both tasks, FaithAct enforces step-level perceptual verification, correcting hallucinated descriptions (top in red) and producing more structured, visually grounded reasoning (bottom in blue).

Theorems & Definitions (4)

  • Lemma 1: Faithfulness Dominance of FaithAct over ReAct
  • proof
  • Corollary 1: Strict Improvement Under Unfaithful Steps
  • proof