Table of Contents
Fetching ...

HALP: Detecting Hallucinations in Vision-Language Models without Generating a Single Token

Sai Akhil Kogilathota, Sripadha Vallabha E G, Luzhe Sun, Jiawei Zhou

TL;DR

It is demonstrated that hallucination risk is detectable pre-generation, the most informative layer and modality vary across architectures, and lightweight probes have the potential to enable early abstention, selective routing, and adaptive decoding to improve both safety and efficiency.

Abstract

Hallucinations remain a persistent challenge for vision-language models (VLMs), which often describe nonexistent objects or fabricate facts. Existing detection methods typically operate after text generation, making intervention both costly and untimely. We investigate whether hallucination risk can instead be predicted before any token is generated by probing a model's internal representations in a single forward pass. Across a diverse set of vision-language tasks and eight modern VLMs, including Llama-3.2-Vision, Gemma-3, Phi-4-VL, and Qwen2.5-VL, we examine three families of internal representations: (i) visual-only features without multimodal fusion, (ii) vision-token representations within the text decoder, and (iii) query-token representations that integrate visual and textual information before generation. Probes trained on these representations achieve strong hallucination-detection performance without decoding, reaching up to 0.93 AUROC on Gemma-3-12B, Phi-4-VL 5.6B, and Molmo 7B. Late query-token states are the most predictive for most models, while visual or mid-layer features dominate in a few architectures (e.g., ~0.79 AUROC for Qwen2.5-VL-7B using visual-only features). These results demonstrate that (1) hallucination risk is detectable pre-generation, (2) the most informative layer and modality vary across architectures, and (3) lightweight probes have the potential to enable early abstention, selective routing, and adaptive decoding to improve both safety and efficiency.

HALP: Detecting Hallucinations in Vision-Language Models without Generating a Single Token

TL;DR

It is demonstrated that hallucination risk is detectable pre-generation, the most informative layer and modality vary across architectures, and lightweight probes have the potential to enable early abstention, selective routing, and adaptive decoding to improve both safety and efficiency.

Abstract

Hallucinations remain a persistent challenge for vision-language models (VLMs), which often describe nonexistent objects or fabricate facts. Existing detection methods typically operate after text generation, making intervention both costly and untimely. We investigate whether hallucination risk can instead be predicted before any token is generated by probing a model's internal representations in a single forward pass. Across a diverse set of vision-language tasks and eight modern VLMs, including Llama-3.2-Vision, Gemma-3, Phi-4-VL, and Qwen2.5-VL, we examine three families of internal representations: (i) visual-only features without multimodal fusion, (ii) vision-token representations within the text decoder, and (iii) query-token representations that integrate visual and textual information before generation. Probes trained on these representations achieve strong hallucination-detection performance without decoding, reaching up to 0.93 AUROC on Gemma-3-12B, Phi-4-VL 5.6B, and Molmo 7B. Late query-token states are the most predictive for most models, while visual or mid-layer features dominate in a few architectures (e.g., ~0.79 AUROC for Qwen2.5-VL-7B using visual-only features). These results demonstrate that (1) hallucination risk is detectable pre-generation, (2) the most informative layer and modality vary across architectures, and (3) lightweight probes have the potential to enable early abstention, selective routing, and adaptive decoding to improve both safety and efficiency.
Paper Structure (67 sections, 1 equation, 7 figures, 14 tables)

This paper contains 67 sections, 1 equation, 7 figures, 14 tables.

Figures (7)

  • Figure 1: Pre-generation hallucination prediction scores on examples across 8 different VLM task domains (benchmarking data distribution in Fig. \ref{['fig:Dataset']}(a)) for Qwen2.5-VL-7B. The probes scores reflect how likely the probe thinks the model will hallucinate based on different pre-generation features (VF, VT, QT as defined in Sec. \ref{['subsec:feature']}).
  • Figure 2: HALP detection pipeline. We extract three types of representations: (1) visual feature vectors from the encoder, (2) vision token states from decoder layers at the final vision patch position, and (3) query token states from the same layers at the final query position. These representations are used independently to probe for hallucination prediction prior to decoding.
  • Figure 3: Hallucination detection dataset distributions of task domains, answer formats, and hallucination question types.
  • Figure 4: AUROC scores for the final Vision Token (VT) representation at different decoder layers.
  • Figure 5: AUROC scores for the final Query Token (QT) representation at different decoder layers.
  • ...and 2 more figures