Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding Data
Spencer Whitehead, Jacob Phillips, Sean Hendryx
TL;DR
The paper addresses the reliability challenge of multimodal language models by reframing hallucination detection as end-to-end sequence labeling that localizes hallucinated spans without predefined spans. It introduces corrupted grounding data, generated by masking grounded spans and filling them with hallucinated phrases from a text-only LM, to pre-train detectors and boost sample efficiency during fine-tuning. Empirical results on M-HalDetect show that pre-training on corrupted grounding data improves performance at low data regimes across model scales, with grounding annotations providing a meaningful learning signal. The approach offers a scalable path to detect and localize multimodal hallucinations, supporting downstream filtering and alignment strategies while highlighting data quality and distribution considerations.
Abstract
Multimodal language models can exhibit hallucinations in their outputs, which limits their reliability. The ability to automatically detect these errors is important for mitigating them, but has been less explored and existing efforts do not localize hallucinations, instead framing this as a classification task. In this work, we first pose multimodal hallucination detection as a sequence labeling task where models must localize hallucinated text spans and present a strong baseline model. Given the high cost of human annotations for this task, we propose an approach to improve the sample efficiency of these models by creating corrupted grounding data, which we use for pre-training. Leveraging phrase grounding data, we generate hallucinations to replace grounded spans and create hallucinated text. Experiments show that pre-training on this data improves sample efficiency when fine-tuning, and that the learning signal from the grounding data plays an important role in these improvements.
