Table of Contents
Fetching ...

KVSmooth: Mitigating Hallucination in Multi-modal Large Language Models through Key-Value Smoothing

Siyu Jiang, Feiyang Chen, Xiaojin Zhang, Kun He

TL;DR

KVSmooth is proposed, a training-free and plug-and-play method that mitigates hallucination by performing attention-entropy-guided adaptive smoothing on hidden states by applying an exponential moving average to both keys and values in the KV-Cache.

Abstract

Despite the significant progress of Multimodal Large Language Models (MLLMs) across diverse tasks, hallucination -- corresponding to the generation of visually inconsistent objects, attributes, or relations -- remains a major obstacle to their reliable deployment. Unlike pure language models, MLLMs must ground their generation process in visual inputs. However, existing models often suffer from semantic drift during decoding, causing outputs to diverge from visual facts as the sequence length increases. To address this issue, we propose KVSmooth, a training-free and plug-and-play method that mitigates hallucination by performing attention-entropy-guided adaptive smoothing on hidden states. Specifically, KVSmooth applies an exponential moving average (EMA) to both keys and values in the KV-Cache, while dynamically quantifying the sink degree of each token through the entropy of its attention distribution to adaptively adjust the smoothing strength. Unlike computationally expensive retraining or contrastive decoding methods, KVSmooth operates efficiently during inference without additional training or model modification. Extensive experiments demonstrate that KVSmooth significantly reduces hallucination ($\mathit{CHAIR}_{S}$ from $41.8 \rightarrow 18.2$) while improving overall performance ($F_1$ score from $77.5 \rightarrow 79.2$), achieving higher precision and recall simultaneously. In contrast, prior methods often improve one at the expense of the other, validating the effectiveness and generality of our approach.

KVSmooth: Mitigating Hallucination in Multi-modal Large Language Models through Key-Value Smoothing

TL;DR

KVSmooth is proposed, a training-free and plug-and-play method that mitigates hallucination by performing attention-entropy-guided adaptive smoothing on hidden states by applying an exponential moving average to both keys and values in the KV-Cache.

Abstract

Despite the significant progress of Multimodal Large Language Models (MLLMs) across diverse tasks, hallucination -- corresponding to the generation of visually inconsistent objects, attributes, or relations -- remains a major obstacle to their reliable deployment. Unlike pure language models, MLLMs must ground their generation process in visual inputs. However, existing models often suffer from semantic drift during decoding, causing outputs to diverge from visual facts as the sequence length increases. To address this issue, we propose KVSmooth, a training-free and plug-and-play method that mitigates hallucination by performing attention-entropy-guided adaptive smoothing on hidden states. Specifically, KVSmooth applies an exponential moving average (EMA) to both keys and values in the KV-Cache, while dynamically quantifying the sink degree of each token through the entropy of its attention distribution to adaptively adjust the smoothing strength. Unlike computationally expensive retraining or contrastive decoding methods, KVSmooth operates efficiently during inference without additional training or model modification. Extensive experiments demonstrate that KVSmooth significantly reduces hallucination ( from ) while improving overall performance ( score from ), achieving higher precision and recall simultaneously. In contrast, prior methods often improve one at the expense of the other, validating the effectiveness and generality of our approach.
Paper Structure (34 sections, 37 equations, 10 figures, 5 tables, 1 algorithm)

This paper contains 34 sections, 37 equations, 10 figures, 5 tables, 1 algorithm.

Figures (10)

  • Figure 1: Variation of object logit scores during caption generation. We analyze 200 images and compute the average score of each object category across different generation stages. Objects are categorized into three groups: (1) GT-InCap - objects appearing in both the image and the generated caption, (2) GT-OutCap - objects present in the image but missing from the caption, and (3) Hallucinated - objects mentioned in the caption but absent from the image. The y-axis denotes the average logit score of each object group, while the x-axis represents the generation progress. Each caption is divided into twenty stages by token count, where each stage includes all tokens up to that point. The mean and variance are computed for each stage and 95% confidence intervals are reported.
  • Figure 2: Distribution of cosine similarity between attention row-entropy and column-sum across all layers during the generation process. The similarity values exhibit a precise unimodal distribution centered around 0.79 with low variance, indicating a stable and strong positive correlation between attention row-entropy and column-sum.
  • Figure 3: Distribution of cosine similarity between logit ranking and attention row-entropy across object types in 200 images. We compute the cosine similarity between row-entropy and ranking scores for three object categories. Hallucinated objects exhibit the highest similarity, indicating that greater row-entropy correlates with stronger hallucination tendencies, whereas genuine objects (GT-InCap and GT-OutCap) show lower or slightly negative correlations.
  • Figure 4: Precision–recall trade-off on LLaVA-1.5 (CHAIR benchmark). Curves nearer the top-right corner denote superior overall performance. KVSmooth attains a strong precision–recall balance, clearly surpassing all competing methods.
  • Figure 5: Sensitivity analysis of the hyperparameter $\lambda_{\text{ref}}$ for KVSmooth based on LLaVA-1.5 and comparisons of four methods in terms of the $\mathop{\mathrm{CHAIR_S}}\limits$-${\text{F}_{1}}$ trade-off (CHAIR benchmark). It is evident that larger values of $\lambda_{\text{ref}}$ lead to stronger smoothing and improved hallucination mitigation. Moreover, our method consistently maintains the balance between precision and recall, demonstrating stability and reliability across different smoothing strengths.
  • ...and 5 more figures