Curing Semantic Drift: A Dynamic Approach to Grounding Generation in Large Vision-Language Models
Jiahe Chen, Jiaying He, Qiyuan Chen, Qian Shao, Jiahe Ying, Hongxia Xu, Jintai Chen, Jianwei Zheng, Jian Wu
TL;DR
Dynamic Logits Calibration (DLC) introduces a training-free, token-level grounding mechanism for large vision-language models by acting as a real-time visual referee. It computes dual visual alignment scores—intrinsic relevance and contextual coherence—relative to a dynamic baseline, combines them into a Relative Visual Advantage (RVA), and adaptively modulates logits with a non-linear strength $\lambda_t$ to favor visually grounded tokens. Across LVLMs and benchmarks (CHAIR, POPE, SHR, GPT-4o, MME, LLaVA-Bench), DLC reduces hallucinations while preserving inference speed and compatibility with various decoding strategies, including 13B-scale models. The approach is practical, scalable, and code will be released to facilitate adoption in trustworthy multimodal systems.
Abstract
Large Vision-Language Models (LVLMs) face a tug-of-war between powerful linguistic priors and visual evidence, often leading to ``semantic drift'' -- the progressive detachment from visual input that we identify as the root cause of hallucination. While several existing training-free decoding strategies have achieved considerable success, they still suffer from inherent limitations. Many are computationally prohibitive, requiring multiple forward passes through the entire LVLM, while others rely on indirect, heuristic-based proxies that are unreliable correlates for a direct semantic conflict. We propose \textbf{D}ynamic \textbf{L}ogits \textbf{C}alibration (DLC), a novel training-free framework that is the first to cure semantic drift in a direct, dynamic, and efficient manner. At each decoding step, DLC introduces a real-time visual referee that performs a dual-aspect visual alignment check: (1) it assesses the intrinsic visual relevance of a candidate token and (2) its contextual visual coherence. By dynamically balancing these two checks and evaluating them against an adaptive baseline, DLC surgically modulates the output logits to favor grounded tokens. Extensive experiments show DLC significantly outperforms existing methods in mitigating hallucinations while, crucially, maintaining high inference efficiency by avoiding costly multiple LVLM forward passes. Our work presents a powerful and practical solution for building more reliable and visually-grounded LVLMs. Code will be released on https://github.com/JiaheChen2002/DLC.
