HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding
Zhaorun Chen, Zhuokai Zhao, Hongyin Luo, Huaxiu Yao, Bo Li, Jiawei Zhou
TL;DR
This work tackles object hallucination in vision-language models by introducing HALC, a plug-and-play decoding algorithm that operates at both local and global levels. locally, HALC uses adaptive focal-contrast grounding to identify token-specific optimal visual contexts and reweight token logits; globally, it uses a matching-based beam search guided by visual-text similarity to select faithful, high-quality outputs. The approach is backed by theoretical guarantees on FOV sampling robustness and extensive experiments across CHAIR, POPE, MME, and LLaVA-Bench, showing state-of-the-art OH reduction with minimal loss in text quality. HALC also provides an open-source platform for evaluating OH-reduction methods across LVLM backbones, facilitating broader adoption and comparison.
Abstract
While large vision-language models (LVLMs) have demonstrated impressive capabilities in interpreting multi-modal contexts, they invariably suffer from object hallucinations (OH). We introduce HALC, a novel decoding algorithm designed to mitigate OH in LVLMs. HALC leverages distinct fine-grained optimal visual information in vision-language tasks and operates on both local and global contexts simultaneously. Specifically, HALC integrates a robust auto-focal grounding mechanism (locally) to correct hallucinated tokens on the fly, and a specialized beam search algorithm (globally) to significantly reduce OH while preserving text generation quality. Additionally, HALC can be integrated into any LVLMs as a plug-and-play module without extra training. Extensive experimental studies demonstrate the effectiveness of HALC in reducing OH, outperforming state-of-the-arts across four benchmarks.
