Table of Contents
Fetching ...

AdaVBoost: Mitigating Hallucinations in LVLMs via Token-Level Adaptive Visual Attention Boosting

Jiacheng Zhang, Feng Liu, Chao Du, Tianyu Pang

TL;DR

A token-level visual attention boosting framework that adaptively determines how much attention to boost at each generation step is proposed, and introduces Visual Grounding Entropy (VGE) to estimate hallucination risk, which leverages visual grounding as a complementary signal to capture evidence mismatches beyond entropy.

Abstract

Visual attention boosting has emerged as a promising direction for mitigating hallucinations in Large Vision-Language Models (LVLMs), where existing methods primarily focus on where to boost by applying a predefined scaling to the attention of method-specific visual tokens during autoregressive generation. In this paper, we identify a fundamental trade-off in these methods: a predefined scaling factor can be too weak at some generation steps, leaving hallucinations unresolved, yet too strong at others, leading to new hallucinations. Motivated by this finding, we propose AdaVBoost, a token-level visual attention boosting framework that adaptively determines how much attention to boost at each generation step. Specifically, we introduce Visual Grounding Entropy (VGE) to estimate hallucination risk, which leverages visual grounding as a complementary signal to capture evidence mismatches beyond entropy. Guided by VGE, AdaVBoost applies stronger visual attention boosting to high-risk tokens and weaker boosting to low-risk tokens, enabling token-level adaptive intervention at each generation step. Extensive experiments show that AdaVBoost significantly outperforms baseline methods across multiple LVLMs and hallucination benchmarks.

AdaVBoost: Mitigating Hallucinations in LVLMs via Token-Level Adaptive Visual Attention Boosting

TL;DR

A token-level visual attention boosting framework that adaptively determines how much attention to boost at each generation step is proposed, and introduces Visual Grounding Entropy (VGE) to estimate hallucination risk, which leverages visual grounding as a complementary signal to capture evidence mismatches beyond entropy.

Abstract

Visual attention boosting has emerged as a promising direction for mitigating hallucinations in Large Vision-Language Models (LVLMs), where existing methods primarily focus on where to boost by applying a predefined scaling to the attention of method-specific visual tokens during autoregressive generation. In this paper, we identify a fundamental trade-off in these methods: a predefined scaling factor can be too weak at some generation steps, leaving hallucinations unresolved, yet too strong at others, leading to new hallucinations. Motivated by this finding, we propose AdaVBoost, a token-level visual attention boosting framework that adaptively determines how much attention to boost at each generation step. Specifically, we introduce Visual Grounding Entropy (VGE) to estimate hallucination risk, which leverages visual grounding as a complementary signal to capture evidence mismatches beyond entropy. Guided by VGE, AdaVBoost applies stronger visual attention boosting to high-risk tokens and weaker boosting to low-risk tokens, enabling token-level adaptive intervention at each generation step. Extensive experiments show that AdaVBoost significantly outperforms baseline methods across multiple LVLMs and hallucination benchmarks.
Paper Structure (30 sections, 16 equations, 9 figures, 11 tables, 1 algorithm)

This paper contains 30 sections, 16 equations, 9 figures, 11 tables, 1 algorithm.

Figures (9)

  • Figure 1: Proof-of-concept experiments. We randomly sample 200 examples from the AMBER benchmark Wang2023AnLM and use GPT-5-mini openai2025gpt5mini as a judge to determine whether a token is hallucinated. The full prompts and detailed justifications are provided in Appendix \ref{['A: definition of hallucination']}. Following the experimental setting of Liu2024PayingMA, we set the scaling factor to 1.2. All experiments are conducted on LLaVA-NeXT-7B liu2024llavanext. (a): Effects of uniform visual attention boosting on hallucinated tokens using a pre-defined scaling factor (i.e., 1.2). Out of 520 hallucinated tokens in the original responses across 200 images, 358 tokens are successfully corrected after boosting visual attention. However, 162 hallucinated tokens remain unresolved, indicating that the predefined scaling factor might be insufficient for some generation steps. Meanwhile, the same scaling factor can be overly aggressive for other tokens, leading to 302 over-boosted hallucinated tokens that are absent from the vanilla response. (b): Analysis of the 162 remaining hallucinated tokens. 135 tokens become non-hallucinatory with stronger visual boost (i.e., larger than 1.2), indicating they are under-boosted at the current scaling factor. Taken together, these observations highlight the need for token-specific visual interventions. (c): Investigation of unfixable tokens. For the remaining 27 tokens that visual boosting alone cannot resolve, we apply attention suppression on text input tokens as a complement to visual boosting. Notably, 22 tokens become non-hallucinatory, leaving only 5 hallucinated tokens.
  • Figure 2: Examples of under-boosted and over-boosted hallucinated tokens in LLaVA-NeXT, Qwen3-VL and InternVL3.5.
  • Figure 3: Left: An example where LLaVA-NeXT generates a hallucinated attribute (i.e., camera) with extremely low token entropy (i.e., extremely high confidence), indicating that using entropy alone sometimes fails to provide reliable risk signals. We use GPT-5-mini as a judge to decide if a token is hallucinated. This observation highlights a structural blind spot for confidence-based metrics (e.g., entropy) and motivates us to design a complementary signal beyond model confidence. Right: In the low-entropy region, we find that hallucinated tokens exhibit substantially weaker visual grounding (VG) scores than normal tokens, suggesting that VG can capture evidence mismatches that entropy fails to capture.
  • Figure 4: Comparison of entropy and VGE as hallucination risk estimators on LLaVA-NeXT. Each bar represents the number of hallucinated tokens within a signal quantile (Q1 = lowest, Q10 = highest). We use GPT-5-mini as a judge to decide if a token is hallucinated. Notably, VGE achieves a stronger correlation with the number of hallucinated tokens compared to entropy alone, demonstrating that incorporating visual grounding information helps identify hallucinated tokens that entropy misses.
  • Figure 5: Inference time comparison of different methods on the CHAIR benchmark using LLaVA-NeXT-7B.
  • ...and 4 more figures