Table of Contents
Fetching ...

HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding

Zhaorun Chen, Zhuokai Zhao, Hongyin Luo, Huaxiu Yao, Bo Li, Jiawei Zhou

TL;DR

This work tackles object hallucination in vision-language models by introducing HALC, a plug-and-play decoding algorithm that operates at both local and global levels. locally, HALC uses adaptive focal-contrast grounding to identify token-specific optimal visual contexts and reweight token logits; globally, it uses a matching-based beam search guided by visual-text similarity to select faithful, high-quality outputs. The approach is backed by theoretical guarantees on FOV sampling robustness and extensive experiments across CHAIR, POPE, MME, and LLaVA-Bench, showing state-of-the-art OH reduction with minimal loss in text quality. HALC also provides an open-source platform for evaluating OH-reduction methods across LVLM backbones, facilitating broader adoption and comparison.

Abstract

While large vision-language models (LVLMs) have demonstrated impressive capabilities in interpreting multi-modal contexts, they invariably suffer from object hallucinations (OH). We introduce HALC, a novel decoding algorithm designed to mitigate OH in LVLMs. HALC leverages distinct fine-grained optimal visual information in vision-language tasks and operates on both local and global contexts simultaneously. Specifically, HALC integrates a robust auto-focal grounding mechanism (locally) to correct hallucinated tokens on the fly, and a specialized beam search algorithm (globally) to significantly reduce OH while preserving text generation quality. Additionally, HALC can be integrated into any LVLMs as a plug-and-play module without extra training. Extensive experimental studies demonstrate the effectiveness of HALC in reducing OH, outperforming state-of-the-arts across four benchmarks.

HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding

TL;DR

This work tackles object hallucination in vision-language models by introducing HALC, a plug-and-play decoding algorithm that operates at both local and global levels. locally, HALC uses adaptive focal-contrast grounding to identify token-specific optimal visual contexts and reweight token logits; globally, it uses a matching-based beam search guided by visual-text similarity to select faithful, high-quality outputs. The approach is backed by theoretical guarantees on FOV sampling robustness and extensive experiments across CHAIR, POPE, MME, and LLaVA-Bench, showing state-of-the-art OH reduction with minimal loss in text quality. HALC also provides an open-source platform for evaluating OH-reduction methods across LVLM backbones, facilitating broader adoption and comparison.

Abstract

While large vision-language models (LVLMs) have demonstrated impressive capabilities in interpreting multi-modal contexts, they invariably suffer from object hallucinations (OH). We introduce HALC, a novel decoding algorithm designed to mitigate OH in LVLMs. HALC leverages distinct fine-grained optimal visual information in vision-language tasks and operates on both local and global contexts simultaneously. Specifically, HALC integrates a robust auto-focal grounding mechanism (locally) to correct hallucinated tokens on the fly, and a specialized beam search algorithm (globally) to significantly reduce OH while preserving text generation quality. Additionally, HALC can be integrated into any LVLMs as a plug-and-play module without extra training. Extensive experimental studies demonstrate the effectiveness of HALC in reducing OH, outperforming state-of-the-arts across four benchmarks.
Paper Structure (29 sections, 1 theorem, 21 equations, 8 figures, 14 tables, 1 algorithm)

This paper contains 29 sections, 1 theorem, 21 equations, 8 figures, 14 tables, 1 algorithm.

Key Result

Theorem 5.1

Let $v^*=(w^*, h^*, p^*)$ be the optimal visual context. Assume there exists a tolerable neighborhood $\mathcal{B}(v^*, \epsilon)=\{\hat{v}: \|\hat{v} - v^*\|\leq \epsilon\}$ around $v^*$, such that decoding from visual contexts within the neighborhood is robust: where $D(\cdot,\cdot)\in [0, 1]$ is a symmetric discrepancy measure between two probability distributions, such as the Jensen-Shannon d

Figures (8)

  • Figure 1: On average, over 84.5% of the observed existence, attribute, and relationship hallucinations are reduced by leveraging some optimal visual context $v^*$. Blue bar denotes number of hallucinated tokens on each corresponding MME sub-task, while orange bar denotes results when decoding from the oracle $v^*$.
  • Figure 2: An overview of HALC. As LVLM autoregressively generates texts w.r.t. an image input (e.g. a man holding a clock on the beach), the conventional decoding method may hallucinate the clock as surfboard. However, HALC corrects this potential hallucination by first locating its visual grounding $v_d$, then sample $n$ distinctive yet overlapping FOVs (e.g. $\tilde{v}_s$, $\tilde{v}_d$, $\tilde{v}_l$). Next, all FOVs are fed back into the LVLM, along with the current ongoing response, obtaining $n$ logits distributions. Then we compute Jensen-Shannon Divergence (JSD) between each pair of the $n$ distributions, and select the top $m$ pairs, providing $2m$ next-token candidates by bi-directional contrasted logits distributions. Each of the $2m$ candidates are then appended to the $k$ ongoing beams (beam search omitted in the figure for simplicity), resulting in $2mk$ response candidates. Finally, $k$ best responses are selected according to the global visual matching score between current text and original image, completing the current decoding round with the hallucinating token surfboard successfully corrected to clock.
  • Figure 3: Log-likelihood of object tokens w.r.t. visual context samples in the FOV space, at the generation step in the example of Fig. \ref{['fig:halc_overview']}. Exponentially expanding FOVs are adopted. While obvious objects (e.g. beach, man) are stable with high likelihood, hallucinating objects are either noisy (e.g. book) or shift gradually with the context (e.g. surfboard). The victim token (e.g. clock) usually display a drastically peaking pattern (local maximum).
  • Figure 4: Comparing four mainstream methods on the ratio of hallucination objects ($\text{CHAIR}_I$) v.s. the number of max tokens. The right axis (dashed line) indicates the total number of generated objects. HALC outperforms all other methods by maintaining a low ratio of hallucination with the increasing of generated objects.
  • Figure 5: Comparison across OH baselines and SOTAs on four OH-critical MME subsets. All methods adopt MiniGPT-4 as LVLM backbone. HALC outperforms all other methods with a large margin: existence: +10.7%; position: +18.3%; color: +19.4% and count: +20.2% in average.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Theorem 5.1
  • proof