Beyond Logit Lens: Contextual Embeddings for Robust Hallucination Detection & Grounding in VLMs
Anirudh Phukan, Divyansh, Harshit Kumar Morj, Vaishnavi, Apoorv Saxena, Koustava Goswami
TL;DR
This work targets hallucinations in Large Multimodal Models by moving beyond the logit lens and introducing ContextualLens, which uses contextual embeddings from middle layers to detect hallucinations and ground answers. The method computes answer-token embeddings and patch embeddings to score visual support via cosine similarity, enabling both hallucination detection and two grounding pathways, including a high-precision bounding-box grounding approach. Evaluations on HQH for detection and TextVQA-X/VizWiz-G for grounding show ContextualLens outperforms the logit lens, and often surpasses output-probability baselines, while remaining training-free. The results demonstrate improved reliability and interpretability for multimodal models, with practical benefits for multimodal attribution and grounded reasoning. The study also discusses layer selection robustness, qualitative grounding examples, and future directions to broaden applicability and recall for more complex tasks.
Abstract
The rapid development of Large Multimodal Models (LMMs) has significantly advanced multimodal understanding by harnessing the language abilities of Large Language Models (LLMs) and integrating modality-specific encoders. However, LMMs are plagued by hallucinations that limit their reliability and adoption. While traditional methods to detect and mitigate these hallucinations often involve costly training or rely heavily on external models, recent approaches utilizing internal model features present a promising alternative. In this paper, we critically assess the limitations of the state-of-the-art training-free technique, the logit lens, in handling generalized visual hallucinations. We introduce ContextualLens, a refined method that leverages contextual token embeddings from middle layers of LMMs. This approach significantly improves hallucination detection and grounding across diverse categories, including actions and OCR, while also excelling in tasks requiring contextual understanding, such as spatial relations and attribute comparison. Our novel grounding technique yields highly precise bounding boxes, facilitating a transition from Zero-Shot Object Segmentation to Grounded Visual Question Answering. Our contributions pave the way for more reliable and interpretable multimodal models.
