Table of Contents
Fetching ...

SAKED: Mitigating Hallucination in Large Vision-Language Models via Stability-Aware Knowledge Enhanced Decoding

Zhaoxu Li, Chenqi Kong, Peijun Bao, Song Xia, Yi Tu, Yi Yu, Xinghao Jiang, Xudong Jiang

TL;DR

Stability-Aware Knowledge-Enhanced Decoding (SAKED), which introduces a layer-wise Knowledge Stability Score (KSS) to quantify knowledge stability throughout the model, and achieves state-of-the-art performance for hallucination mitigation on various models, tasks, and benchmarks.

Abstract

Hallucinations in Large Vision-Language Models (LVLMs) pose significant security and reliability risks in real-world applications. Inspired by the observation that humans are more error-prone when uncertain or hesitant, we investigate how instability in a model 's internal knowledge contributes to LVLM hallucinations. We conduct extensive empirical analyses from three perspectives, namely attention heads, model layers, and decoding tokens, and identify three key hallucination patterns: (i) visual activation drift across attention heads, (ii) pronounced knowledge fluctuations across layers, and (iii) visual focus distraction between neighboring output tokens. Building on these findings, we propose Stability-Aware Knowledge-Enhanced Decoding (SAKED), which introduces a layer-wise Knowledge Stability Score (KSS) to quantify knowledge stability throughout the model. By contrasting the most stability-aware and stability-agnostic layers, SAKED suppresses decoding noise and dynamically leverages the most reliable internal knowledge for faithful token generation. Moreover, SAKED is training-free and can be seamlessly integrated into different architectures. Extensive experiments demonstrate that SAKED achieves state-of-the-art performance for hallucination mitigation on various models, tasks, and benchmarks.

SAKED: Mitigating Hallucination in Large Vision-Language Models via Stability-Aware Knowledge Enhanced Decoding

TL;DR

Stability-Aware Knowledge-Enhanced Decoding (SAKED), which introduces a layer-wise Knowledge Stability Score (KSS) to quantify knowledge stability throughout the model, and achieves state-of-the-art performance for hallucination mitigation on various models, tasks, and benchmarks.

Abstract

Hallucinations in Large Vision-Language Models (LVLMs) pose significant security and reliability risks in real-world applications. Inspired by the observation that humans are more error-prone when uncertain or hesitant, we investigate how instability in a model 's internal knowledge contributes to LVLM hallucinations. We conduct extensive empirical analyses from three perspectives, namely attention heads, model layers, and decoding tokens, and identify three key hallucination patterns: (i) visual activation drift across attention heads, (ii) pronounced knowledge fluctuations across layers, and (iii) visual focus distraction between neighboring output tokens. Building on these findings, we propose Stability-Aware Knowledge-Enhanced Decoding (SAKED), which introduces a layer-wise Knowledge Stability Score (KSS) to quantify knowledge stability throughout the model. By contrasting the most stability-aware and stability-agnostic layers, SAKED suppresses decoding noise and dynamically leverages the most reliable internal knowledge for faithful token generation. Moreover, SAKED is training-free and can be seamlessly integrated into different architectures. Extensive experiments demonstrate that SAKED achieves state-of-the-art performance for hallucination mitigation on various models, tasks, and benchmarks.
Paper Structure (30 sections, 14 equations, 10 figures, 10 tables, 1 algorithm)

This paper contains 30 sections, 14 equations, 10 figures, 10 tables, 1 algorithm.

Figures (10)

  • Figure 1: (a) SAKED mines internal stability-aware knowledge to enhance the decoding process. (b) SAKED consistently achieves outstanding hallucination performance on CHAIR, POPE, MME, and AMBER.
  • Figure 3: (a) CHSS distribution across model layers. (b) Token probability distributions across model layers. For clarity, we show only the three highest probability tokens in the final layer: "dogs", "small", and "statues".
  • Figure 4: (a) Probability distributions of the generated token "people" across model layers (due to space limitations, we show the four highest-probability tokens in the final layer: "passengers", "people", "bus", and "ben"); (b) JSD between the logit distributions of each layer and its adjacent layer, where lower JSD indicates a more stable flow of knowledge.
  • Figure 5: VFD distributions of visual attention between each token and its adjacent token across model layers. The grounded token "bus" and the hallucinated token "people" are highlighted in green and red, respectively. Compared to real tokens, the hallucinated token exhibits substantial visual focus distraction during decoding.
  • Figure 6: Detailed MME evaluation results on 10 subsets: Existence, Count, Position, Color, Poster, Celebrity, Scene, Landmark, Artwork, and OCR on (a) Qwen2.5VL, (b) InternVL3, (c) LLaVA1.5, and (d) Average results on the three models.
  • ...and 5 more figures