Seeing It or Not? Interpretable Vision-aware Latent Steering to Mitigate Object Hallucinations
Boxu Chen, Ziwei Zheng, Le Yang, Zeyu Geng, Zhengyu Zhao, Chenhao Lin, Chao Shen
TL;DR
This work tackles object hallucination in large vision-language models by introducing VaLSe, a Vision-aware Latent Steering framework that first interprets how visual inputs influence token generation and then mitigates OH through latent-space edits. VaLSe builds visual contribution maps for selected tokens using an interpretation-then-mitigation pipeline that includes artifact decontamination, paired-sample construction with vision-guided masking, and a latent steering step where the model’s hidden representations are adjusted via the top singular direction of the difference between positive and negative samples, $E_l = X_l^+ - X_l^- = U_l \Sigma_l V_l^\top$ with $v_l^{\text{edit}}$ guiding $x_l \leftarrow x_l + v_l^{\text{edit}}$. The method uses a log-likelihood ratio, $LLR(y_t) = \log P(y_t|y_{<t}, I, T) - \log P(y_t|y_{<t}, \tilde{I}, T)$, to identify visually grounded tokens and constructs positive samples by masking low-contribution regions, enabling targeted latent edits. Experiments across CHAIR, AMBER, POPE, MMHal, MMVP, and general benchmarks demonstrate that VaLSe reduces object hallucinations while preserving or improving broader multimodal capabilities, and reveal limitations in existing OH metrics that motivate more nuanced, visually grounded evaluations.
Abstract
Large Vision-Language Models (LVLMs) have achieved remarkable success but continue to struggle with object hallucination (OH), generating outputs inconsistent with visual inputs. While previous work has proposed methods to reduce OH, the visual decision-making mechanisms that lead to hallucinations remain poorly understood. In this paper, we propose VaLSe, a Vision-aware Latent Steering framework that adopts an interpretation-then-mitigation strategy to address OH in LVLMs. By tackling dual challenges of modeling complex vision-language interactions and eliminating spurious activation artifacts, VaLSe can generate visual contribution maps that trace how specific visual inputs influence individual output tokens. These maps reveal the model's vision-aware focus regions, which are then used to perform latent space steering, realigning internal representations toward semantically relevant content and reducing hallucinated outputs. Extensive experiments demonstrate that VaLSe is a powerful interpretability tool and an effective method for enhancing model robustness against OH across multiple benchmarks. Furthermore, our analysis uncovers limitations in existing OH evaluation metrics, underscoring the need for more nuanced, interpretable, and visually grounded OH benchmarks in future work. Code is available at: https://github.com/Ziwei-Zheng/VaLSe.
