GLSim: Detecting Object Hallucinations in LVLMs via Global-Local Similarity
Seongheon Park, Sharon Li
TL;DR
GLSim tackles object hallucination in LVLMs by introducing a training-free detector that fuses global scene-context with local visual grounding through embedding similarity. It computes a global score from the object and final instruction embedding and a local score from Top-K patches identified via a Visual Logit Lens, combining them as $s_{GLSim}(o,\mathbf{x})= w·s_{global}(o,\mathbf{x})+(1-w)·s_{local}(o,\mathbf{x})$. Across MSCOCO and Objects365, GLSim consistently outperforms state-of-the-art baselines across multiple LVLMs, with up to +12.7 percentage points AUROC gains and robust AUPR improvements, validated by extensive ablations showing the complementary value of global and local signals. The approach enables efficient, real-time, self-evaluating hallucination detection without external models, contributing to safer and more trustworthy LVLM deployments in diverse real-world contexts.
Abstract
Object hallucination in large vision-language models presents a significant challenge to their safe deployment in real-world applications. Recent works have proposed object-level hallucination scores to estimate the likelihood of object hallucination; however, these methods typically adopt either a global or local perspective in isolation, which may limit detection reliability. In this paper, we introduce GLSim, a novel training-free object hallucination detection framework that leverages complementary global and local embedding similarity signals between image and text modalities, enabling more accurate and reliable hallucination detection in diverse scenarios. We comprehensively benchmark existing object hallucination detection methods and demonstrate that GLSim achieves superior detection performance, outperforming competitive baselines by a significant margin.
