Table of Contents
Fetching ...

One Token, Two Fates: A Unified Framework via Vision Token Manipulation Against MLLMs Hallucination

Zhan Fa, Yue Duan, Jian Zhang, Lei Qi, Yinghuan Shi

TL;DR

By harmonizing these two roles, this framework effectively restores the vision-language balance, significantly reducing object hallucinations, improving POPE accuracy by an average of 2% absolute on LLaVA-1.5 across multiple benchmarks with only a 1.06x inference latency overhead.

Abstract

Current training-free methods tackle MLLM hallucination with separate strategies: either enhancing visual signals or suppressing text inertia. However, these separate methods are insufficient due to critical trade-offs: simply enhancing vision often fails against strong language prior, while suppressing language can introduce extra image-irrelevant noise. Moreover, we find their naive combination is also ineffective, necessitating a unified framework. We propose such a framework by focusing on the core asset: the vision token. Our design leverages two key insights: (1) augmented images offer complementary visual semantics, and (2) removing vision tokens (information-gap) isolates hallucination tendencies more precisely than distorting images (modality-gap). Based on these, our framework uses vision tokens in two distinct ways, both operating on latent representations: our Synergistic Visual Calibration (SVC) module incorporates augmented tokens to strengthen visual representations, while our Causal Representation Calibration (CRC) module uses pruned tokens to create latent-space negative samples for correcting internal model biases. By harmonizing these two roles, our framework effectively restores the vision-language balance, significantly reducing object hallucinations, improving POPE accuracy by an average of 2% absolute on LLaVA-1.5 across multiple benchmarks with only a 1.06x inference latency overhead.

One Token, Two Fates: A Unified Framework via Vision Token Manipulation Against MLLMs Hallucination

TL;DR

By harmonizing these two roles, this framework effectively restores the vision-language balance, significantly reducing object hallucinations, improving POPE accuracy by an average of 2% absolute on LLaVA-1.5 across multiple benchmarks with only a 1.06x inference latency overhead.

Abstract

Current training-free methods tackle MLLM hallucination with separate strategies: either enhancing visual signals or suppressing text inertia. However, these separate methods are insufficient due to critical trade-offs: simply enhancing vision often fails against strong language prior, while suppressing language can introduce extra image-irrelevant noise. Moreover, we find their naive combination is also ineffective, necessitating a unified framework. We propose such a framework by focusing on the core asset: the vision token. Our design leverages two key insights: (1) augmented images offer complementary visual semantics, and (2) removing vision tokens (information-gap) isolates hallucination tendencies more precisely than distorting images (modality-gap). Based on these, our framework uses vision tokens in two distinct ways, both operating on latent representations: our Synergistic Visual Calibration (SVC) module incorporates augmented tokens to strengthen visual representations, while our Causal Representation Calibration (CRC) module uses pruned tokens to create latent-space negative samples for correcting internal model biases. By harmonizing these two roles, our framework effectively restores the vision-language balance, significantly reducing object hallucinations, improving POPE accuracy by an average of 2% absolute on LLaVA-1.5 across multiple benchmarks with only a 1.06x inference latency overhead.
Paper Structure (22 sections, 12 equations, 10 figures, 5 tables)

This paper contains 22 sections, 12 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Disjoint Paradigms vs. Our Unified Latent Calibration.(a) Prior work includes Visual Attention Enhancement liu2024paying and (b) Textual Decoding Refinement at final logits leng2024mitigating. Naively combining these disjoint paradigms (a)+(b) degrades performance, highlighting conflicting signals. (c) Our Unified Latent Calibration is a unified system operating entirely at the representation level, using the vision token as a single source for both SVC and CRC, achieving superior and synergistic results.
  • Figure 2: Our Three Core Findings.(F1) Diagnosing the Imbalance: Inverse Correlation. Visual attention decays sharply as generation proceeds, while hallucination frequency surges where visual grounding is weakest. (F2) Enabling Enhancement: Semantic Complementarity. Original and augmented image attentions show complementary focus (e.g., on 'Camera'); their synergy enables enhanced visual grounding. (F3) Enabling Calibration: Superiority of Information-Gap. Latent-space token removal (information-gap) generates stable, grounded hallucinations, proving more suitable for bias probing than unstable, noisy pixel-level masking (modality-gap).
  • Figure 3: Overview of our unified framework. The model processes an original input stream (orange path) and a parallel hallucination-probe stream (purple path) derived from pruned vision tokens. Our Synergistic Visual Calibration (SVC) module injects complementary visual context from augmented images into a critical middle layer ($L_c$) to counteract visual fading. Simultaneously, the Causal Representation Calibration (CRC) module uses the differential representations between the two streams to purify hidden states in shallow layers ($1 \dots L_c$), suppressing linguistic priors.
  • Figure 4: Illustration of the Causal Representation Calibration (CRC) mechanism. By subtracting the hallucinated representation ($\mathbf{H}_{\text{neg}}$) from the original ($\mathbf{H}_{\text{org}}$), we obtain a differential vector ($\Delta \mathbf{H}$). Averaging these vectors across multiple negative samples yields a stable hallucination direction ($\mathbf{v}_{\text{crc}}$). The final calibration modifies the original representation away from this direction to produce a purified output ($\mathbf{H}_{\text{pos}}$).
  • Figure 5: The simplified Structural Causal Model (SCM) for MLLM hallucination. We posit that hallucination arises from a spurious causal path from the model's intrinsic bias ($B$) to its latent representation ($H^{(l)}$), which confounds the true visual path from the image ($V$).
  • ...and 5 more figures