Table of Contents
Fetching ...

Seeing It or Not? Interpretable Vision-aware Latent Steering to Mitigate Object Hallucinations

Boxu Chen, Ziwei Zheng, Le Yang, Zeyu Geng, Zhengyu Zhao, Chenhao Lin, Chao Shen

TL;DR

This work tackles object hallucination in large vision-language models by introducing VaLSe, a Vision-aware Latent Steering framework that first interprets how visual inputs influence token generation and then mitigates OH through latent-space edits. VaLSe builds visual contribution maps for selected tokens using an interpretation-then-mitigation pipeline that includes artifact decontamination, paired-sample construction with vision-guided masking, and a latent steering step where the model’s hidden representations are adjusted via the top singular direction of the difference between positive and negative samples, $E_l = X_l^+ - X_l^- = U_l \Sigma_l V_l^\top$ with $v_l^{\text{edit}}$ guiding $x_l \leftarrow x_l + v_l^{\text{edit}}$. The method uses a log-likelihood ratio, $LLR(y_t) = \log P(y_t|y_{<t}, I, T) - \log P(y_t|y_{<t}, \tilde{I}, T)$, to identify visually grounded tokens and constructs positive samples by masking low-contribution regions, enabling targeted latent edits. Experiments across CHAIR, AMBER, POPE, MMHal, MMVP, and general benchmarks demonstrate that VaLSe reduces object hallucinations while preserving or improving broader multimodal capabilities, and reveal limitations in existing OH metrics that motivate more nuanced, visually grounded evaluations.

Abstract

Large Vision-Language Models (LVLMs) have achieved remarkable success but continue to struggle with object hallucination (OH), generating outputs inconsistent with visual inputs. While previous work has proposed methods to reduce OH, the visual decision-making mechanisms that lead to hallucinations remain poorly understood. In this paper, we propose VaLSe, a Vision-aware Latent Steering framework that adopts an interpretation-then-mitigation strategy to address OH in LVLMs. By tackling dual challenges of modeling complex vision-language interactions and eliminating spurious activation artifacts, VaLSe can generate visual contribution maps that trace how specific visual inputs influence individual output tokens. These maps reveal the model's vision-aware focus regions, which are then used to perform latent space steering, realigning internal representations toward semantically relevant content and reducing hallucinated outputs. Extensive experiments demonstrate that VaLSe is a powerful interpretability tool and an effective method for enhancing model robustness against OH across multiple benchmarks. Furthermore, our analysis uncovers limitations in existing OH evaluation metrics, underscoring the need for more nuanced, interpretable, and visually grounded OH benchmarks in future work. Code is available at: https://github.com/Ziwei-Zheng/VaLSe.

Seeing It or Not? Interpretable Vision-aware Latent Steering to Mitigate Object Hallucinations

TL;DR

This work tackles object hallucination in large vision-language models by introducing VaLSe, a Vision-aware Latent Steering framework that first interprets how visual inputs influence token generation and then mitigates OH through latent-space edits. VaLSe builds visual contribution maps for selected tokens using an interpretation-then-mitigation pipeline that includes artifact decontamination, paired-sample construction with vision-guided masking, and a latent steering step where the model’s hidden representations are adjusted via the top singular direction of the difference between positive and negative samples, with guiding . The method uses a log-likelihood ratio, , to identify visually grounded tokens and constructs positive samples by masking low-contribution regions, enabling targeted latent edits. Experiments across CHAIR, AMBER, POPE, MMHal, MMVP, and general benchmarks demonstrate that VaLSe reduces object hallucinations while preserving or improving broader multimodal capabilities, and reveal limitations in existing OH metrics that motivate more nuanced, visually grounded evaluations.

Abstract

Large Vision-Language Models (LVLMs) have achieved remarkable success but continue to struggle with object hallucination (OH), generating outputs inconsistent with visual inputs. While previous work has proposed methods to reduce OH, the visual decision-making mechanisms that lead to hallucinations remain poorly understood. In this paper, we propose VaLSe, a Vision-aware Latent Steering framework that adopts an interpretation-then-mitigation strategy to address OH in LVLMs. By tackling dual challenges of modeling complex vision-language interactions and eliminating spurious activation artifacts, VaLSe can generate visual contribution maps that trace how specific visual inputs influence individual output tokens. These maps reveal the model's vision-aware focus regions, which are then used to perform latent space steering, realigning internal representations toward semantically relevant content and reducing hallucinated outputs. Extensive experiments demonstrate that VaLSe is a powerful interpretability tool and an effective method for enhancing model robustness against OH across multiple benchmarks. Furthermore, our analysis uncovers limitations in existing OH evaluation metrics, underscoring the need for more nuanced, interpretable, and visually grounded OH benchmarks in future work. Code is available at: https://github.com/Ziwei-Zheng/VaLSe.

Paper Structure

This paper contains 51 sections, 8 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: The proposed VaLSe can effectively (a) eliminate artifacts and provide high quality visualization results, and then (b) mitigate OH by vision-aware latent steering. With the ability of mitigating OH, VaLSe can further provide in-depth analysis of (c) how a word token is generated based on visual information and (d) inferring why a hallucinated words is generated.
  • Figure 2: VaLSe mainly contains three modules: (a) A visualization module that generates visual token contribution maps for each selected token; (b) A vision-aware masking module creating masked images while preserving the main semantic contents; (c) A latent steering mechanism.
  • Figure 3: Results on MME.
  • Figure 4: The visualization and analysis results via VaLSe of four different types of hallucination using LLaVA-1.5 on the CHAIR benchmark.
  • Figure 5: Further analysis with visualization results using LLaVA-1.5.
  • ...and 7 more figures