Table of Contents
Fetching ...

GLSim: Detecting Object Hallucinations in LVLMs via Global-Local Similarity

Seongheon Park, Sharon Li

TL;DR

GLSim tackles object hallucination in LVLMs by introducing a training-free detector that fuses global scene-context with local visual grounding through embedding similarity. It computes a global score from the object and final instruction embedding and a local score from Top-K patches identified via a Visual Logit Lens, combining them as $s_{GLSim}(o,\mathbf{x})= w·s_{global}(o,\mathbf{x})+(1-w)·s_{local}(o,\mathbf{x})$. Across MSCOCO and Objects365, GLSim consistently outperforms state-of-the-art baselines across multiple LVLMs, with up to +12.7 percentage points AUROC gains and robust AUPR improvements, validated by extensive ablations showing the complementary value of global and local signals. The approach enables efficient, real-time, self-evaluating hallucination detection without external models, contributing to safer and more trustworthy LVLM deployments in diverse real-world contexts.

Abstract

Object hallucination in large vision-language models presents a significant challenge to their safe deployment in real-world applications. Recent works have proposed object-level hallucination scores to estimate the likelihood of object hallucination; however, these methods typically adopt either a global or local perspective in isolation, which may limit detection reliability. In this paper, we introduce GLSim, a novel training-free object hallucination detection framework that leverages complementary global and local embedding similarity signals between image and text modalities, enabling more accurate and reliable hallucination detection in diverse scenarios. We comprehensively benchmark existing object hallucination detection methods and demonstrate that GLSim achieves superior detection performance, outperforming competitive baselines by a significant margin.

GLSim: Detecting Object Hallucinations in LVLMs via Global-Local Similarity

TL;DR

GLSim tackles object hallucination in LVLMs by introducing a training-free detector that fuses global scene-context with local visual grounding through embedding similarity. It computes a global score from the object and final instruction embedding and a local score from Top-K patches identified via a Visual Logit Lens, combining them as . Across MSCOCO and Objects365, GLSim consistently outperforms state-of-the-art baselines across multiple LVLMs, with up to +12.7 percentage points AUROC gains and robust AUPR improvements, validated by extensive ablations showing the complementary value of global and local signals. The approach enables efficient, real-time, self-evaluating hallucination detection without external models, contributing to safer and more trustworthy LVLM deployments in diverse real-world contexts.

Abstract

Object hallucination in large vision-language models presents a significant challenge to their safe deployment in real-world applications. Recent works have proposed object-level hallucination scores to estimate the likelihood of object hallucination; however, these methods typically adopt either a global or local perspective in isolation, which may limit detection reliability. In this paper, we introduce GLSim, a novel training-free object hallucination detection framework that leverages complementary global and local embedding similarity signals between image and text modalities, enabling more accurate and reliable hallucination detection in diverse scenarios. We comprehensively benchmark existing object hallucination detection methods and demonstrate that GLSim achieves superior detection performance, outperforming competitive baselines by a significant margin.

Paper Structure

This paper contains 44 sections, 13 equations, 13 figures, 12 tables.

Figures (13)

  • Figure 1: Overall framework. (a) We detect object-level hallucinations by leveraging latent embedding similarity. (b) For each object, the most relevant image regions are identified via unembedding from latent image representations. (c) The final GLSim score is computed as a weighted combination of local (\ref{['sec:local']}) and global (\ref{['sec:global']}) signals, capturing both scene-level plausibility and spatial alignment, enhancing object hallucination detection accuracy.
  • Figure 2: Qualitative evidence. In the generated descriptions, hallucinated objects are highlighted in red. The localized image regions are shaded with the same color as their corresponding objects. The gray line shows a threshold value $\tau$. If an object’s score is lower than the threshold $\tau$, we consider it a hallucination. In (a), the local score successfully compensates for the failure of the global score, while in (b), the global score offsets the limitations of the local score.
  • Figure 3: Internal Confidence (IC) can assign high confidence to incorrect regions for hallucinated objects. Our local score (\ref{['sec:local']}) mitigates this by cross-modal embedding similarity.
  • Figure 4: Object grounding results with LLaVA. Ground-truth bounding boxes are shown in red.
  • Figure 5: (a) Effect of the number of selected image patches $K$; (b) effect of the weighting parameter $w$ in \ref{['equ:glsim']}; and (c) effect of the text embedding layer index $l'$.
  • ...and 8 more figures

Theorems & Definitions (1)

  • Definition 3.1: Object Hallucination Detector