Revis: Sparse Latent Steering to Mitigate Object Hallucination in Large Vision-Language Models

Jialin Wu; Wei Shi; Han Shen; Peigui Qi; Kunsheng Tang; Zhicong Huang; Binghao Wang; Zhou Yang

Revis: Sparse Latent Steering to Mitigate Object Hallucination in Large Vision-Language Models

Jialin Wu, Wei Shi, Han Shen, Peigui Qi, Kunsheng Tang, Zhicong Huang, Binghao Wang, Zhou Yang

TL;DR

Revis addresses object hallucination in vision-language models by decoupling visual information from language priors through an orthogonal projection that yields a purified visual vector. It then performs sparse, calibration-guided interventions at a selectively chosen network depth, gated by a dynamic risk threshold to maintain stability and efficiency. Empirical results across multiple architectures and benchmarks show a consistent reduction in hallucinations (up to around 19% relative) while preserving or enhancing general reasoning performance, with minimal inference latency. This approach offers a practical, training-free solution that generalizes across diverse LVLM backbones and tasks, enabling safer grounded generation in real-time settings.

Abstract

Despite the advanced capabilities of Large Vision-Language Models (LVLMs), they frequently suffer from object hallucination. One reason is that visual features and pretrained textual representations often become intertwined in the deeper network layers. To address this, we propose REVIS, a training-free framework designed to explicitly re-activate this suppressed visual information. Rooted in latent space geometry, REVIS extracts the pure visual information vector via orthogonal projection and employs a calibrated strategy to perform sparse intervention only at the precise depth where suppression occurs. This surgical approach effectively restores visual information with minimal computational cost. Empirical evaluations on standard benchmarks demonstrate that REVIS reduces object hallucination rates by approximately 19% compared to state-of-the-art baselines, while preserving general reasoning capabilities.

Revis: Sparse Latent Steering to Mitigate Object Hallucination in Large Vision-Language Models

TL;DR

Abstract

Paper Structure (40 sections, 9 equations, 9 figures, 13 tables, 2 algorithms)

This paper contains 40 sections, 9 equations, 9 figures, 13 tables, 2 algorithms.

Introduction
Related Work
Hallucination Mitigation in LVLMs
Mechanistic Interpretability
Preliminary Analysis
Feasibility Validation
Analysis of Feature Entanglement
Motivation for Sparse Intervention
Revis: Design Details
Orthogonal Visual Vector Construction
Layer Selection via Calibration
Dynamic Steering Mechanism
Experiments
Setup
Overall Performance
...and 25 more sections

Figures (9)

Figure 1: Overview of Revis. After applying orthogonal projection to isolate the pure visual vector from language priors, Revis steers the LVLM to generate faithful and grounded descriptions, effectively correcting the initial hallucinations (e.g., "chocolate", "strawberries").
Figure 2: t-SNE visualization of [EOS] token hidden states at Layer 27 of Qwen2.5-VL. The [EOS] token captures global semantics, verifying that state separation is driven by content factuality rather than lexical variations.
Figure 3: Sensitivity analysis of $\alpha$. Naive steering leads to model collapse (sharp metric drop) at high intensities.
Figure 4: Design of Revis. Revis utilizes orthogonal projection to extract purified visual vectors, and performs sparse intervention through calibration-based layer selection and inference-time dynamic risk-aware steering.
Figure 5: Inference Latency Comparison. We report the mean Time Per Token (TPT) in seconds. Revis maintains the efficiency of Regular decoding.
...and 4 more figures

Revis: Sparse Latent Steering to Mitigate Object Hallucination in Large Vision-Language Models

TL;DR

Abstract

Revis: Sparse Latent Steering to Mitigate Object Hallucination in Large Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)