Table of Contents
Fetching ...

Curing Semantic Drift: A Dynamic Approach to Grounding Generation in Large Vision-Language Models

Jiahe Chen, Jiaying He, Qiyuan Chen, Qian Shao, Jiahe Ying, Hongxia Xu, Jintai Chen, Jianwei Zheng, Jian Wu

TL;DR

Dynamic Logits Calibration (DLC) introduces a training-free, token-level grounding mechanism for large vision-language models by acting as a real-time visual referee. It computes dual visual alignment scores—intrinsic relevance and contextual coherence—relative to a dynamic baseline, combines them into a Relative Visual Advantage (RVA), and adaptively modulates logits with a non-linear strength $\lambda_t$ to favor visually grounded tokens. Across LVLMs and benchmarks (CHAIR, POPE, SHR, GPT-4o, MME, LLaVA-Bench), DLC reduces hallucinations while preserving inference speed and compatibility with various decoding strategies, including 13B-scale models. The approach is practical, scalable, and code will be released to facilitate adoption in trustworthy multimodal systems.

Abstract

Large Vision-Language Models (LVLMs) face a tug-of-war between powerful linguistic priors and visual evidence, often leading to ``semantic drift'' -- the progressive detachment from visual input that we identify as the root cause of hallucination. While several existing training-free decoding strategies have achieved considerable success, they still suffer from inherent limitations. Many are computationally prohibitive, requiring multiple forward passes through the entire LVLM, while others rely on indirect, heuristic-based proxies that are unreliable correlates for a direct semantic conflict. We propose \textbf{D}ynamic \textbf{L}ogits \textbf{C}alibration (DLC), a novel training-free framework that is the first to cure semantic drift in a direct, dynamic, and efficient manner. At each decoding step, DLC introduces a real-time visual referee that performs a dual-aspect visual alignment check: (1) it assesses the intrinsic visual relevance of a candidate token and (2) its contextual visual coherence. By dynamically balancing these two checks and evaluating them against an adaptive baseline, DLC surgically modulates the output logits to favor grounded tokens. Extensive experiments show DLC significantly outperforms existing methods in mitigating hallucinations while, crucially, maintaining high inference efficiency by avoiding costly multiple LVLM forward passes. Our work presents a powerful and practical solution for building more reliable and visually-grounded LVLMs. Code will be released on https://github.com/JiaheChen2002/DLC.

Curing Semantic Drift: A Dynamic Approach to Grounding Generation in Large Vision-Language Models

TL;DR

Dynamic Logits Calibration (DLC) introduces a training-free, token-level grounding mechanism for large vision-language models by acting as a real-time visual referee. It computes dual visual alignment scores—intrinsic relevance and contextual coherence—relative to a dynamic baseline, combines them into a Relative Visual Advantage (RVA), and adaptively modulates logits with a non-linear strength to favor visually grounded tokens. Across LVLMs and benchmarks (CHAIR, POPE, SHR, GPT-4o, MME, LLaVA-Bench), DLC reduces hallucinations while preserving inference speed and compatibility with various decoding strategies, including 13B-scale models. The approach is practical, scalable, and code will be released to facilitate adoption in trustworthy multimodal systems.

Abstract

Large Vision-Language Models (LVLMs) face a tug-of-war between powerful linguistic priors and visual evidence, often leading to ``semantic drift'' -- the progressive detachment from visual input that we identify as the root cause of hallucination. While several existing training-free decoding strategies have achieved considerable success, they still suffer from inherent limitations. Many are computationally prohibitive, requiring multiple forward passes through the entire LVLM, while others rely on indirect, heuristic-based proxies that are unreliable correlates for a direct semantic conflict. We propose \textbf{D}ynamic \textbf{L}ogits \textbf{C}alibration (DLC), a novel training-free framework that is the first to cure semantic drift in a direct, dynamic, and efficient manner. At each decoding step, DLC introduces a real-time visual referee that performs a dual-aspect visual alignment check: (1) it assesses the intrinsic visual relevance of a candidate token and (2) its contextual visual coherence. By dynamically balancing these two checks and evaluating them against an adaptive baseline, DLC surgically modulates the output logits to favor grounded tokens. Extensive experiments show DLC significantly outperforms existing methods in mitigating hallucinations while, crucially, maintaining high inference efficiency by avoiding costly multiple LVLM forward passes. Our work presents a powerful and practical solution for building more reliable and visually-grounded LVLMs. Code will be released on https://github.com/JiaheChen2002/DLC.

Paper Structure

This paper contains 27 sections, 10 equations, 12 figures, 8 tables, 1 algorithm.

Figures (12)

  • Figure 1: Comparison of decoding methods across three dimensions: Speed, Correctness, and Detailedness using LLaVA-1.5 llava, with all experiments conducted on a single NVIDIA A100 GPU. The vertical axis represents correctness, while the horizontal axis shows tokens per second (TPS). Bubble size corresponds to Detailedness of responses.
  • Figure 2: Visualization of semantic drift and token selection in LVLMs. The central plot tracks CCTA against a Historical Baseline $\bar{B}_{t}$, contrasting visually faithful (green) and hallucination (red) phases. The surrounding bar charts provide snapshots of token selection.
  • Figure 3: Overview of Dynamic Logits Calibration (DLC). Given an input image and prompt, DLC first performs real-time visual alignment assessment on top-k candidate tokens by calculating CCTA and ITA scores relative to a $\bar{B}_{t}$. These scores inform the adaptive logit modulation step, which computes RVA and $\lambda_{t}$ to adjust the original logits ($L_{t,i}$), favoring visually grounded tokens.
  • Figure 4: SHR evaluation results. Six aspects are analyzed, including the number of sentences per image (SPI), the tokens per second (TPS), the number of hallucinated sentences per image (HSPI), the number of hallucinated words per image (HWPI), the ratio of hallucinated sentences (HSR), and the ratio of hallucinated words (HWR). Larger radar indicates better performance.
  • Figure 5: Evaluation on the subset of the MME benchmark.
  • ...and 7 more figures