Table of Contents
Fetching ...

LISA: A Layer-wise Integration and Suppression Approach for Hallucination Mitigation in Multimodal Large Language Models

Zhihui Guo, Xin Man, Hui Xu, Jie Shao, Zhiguo Jiang, Xianchao Zhang, Heng Tao Shen

TL;DR

LISA addresses object hallucination in Multimodal LLMs by introducing a layer-aware, training-free decoding framework that stabilizes cross-layer attention through spectral modulation, and adaptively fuses multi-layer signals via anchor-based routing. By explicitly partitioning transformers into shallow grounding, middle semantic, and deep suppressive zones, LISA suppresses unstable deep-layer activations while preserving grounded information. The approach combines layer-wise spectral suppression, cross-layer fusion, and token-wise soft fusion, enabling adaptive integration during decoding without retraining. Across MSCOCO-based benchmarks and multiple MLLMs, LISA consistently reduces hallucinations and improves visual grounding and recall, demonstrating practical, generalizable gains for reliable multimodal generation.

Abstract

Multimodal Large Language Models (MLLMs) excel in vision-language tasks such as image captioning but remain prone to object hallucinations, where they describe objects that do not appear in the image. To mitigate this, we propose LISA, a Layer-wise Integration and Suppression Approach. LISA leverages the layer-wise functional roles in MLLMs: shallow layers provide visual grounding, middle layers encode semantics, and deep layers tend to amplify spurious signals. First, layer-wise spectral modulation stabilizes attention by suppressing over-amplified activations in deeper layers while preserving alignment cues in earlier layers. Second, token-level logits from selected layers are fused via anchor-based routing, with token-wise anchor selection and soft logit fusion enabling adaptive integration during decoding. LISA is fully plug-and-play and can be seamlessly integrated into existing MLLMs, including Qwen2.5-VL. Experiments on multiple benchmarks show that LISA reduces hallucinations by up to 53.6% in $\text{CHAIR}_\text{I}$ and improves POPE F1 by up to 5.1%, demonstrating strong generalization across models and tasks. Our code is available at https://github.com/zhlisa1010-eng/LISA.

LISA: A Layer-wise Integration and Suppression Approach for Hallucination Mitigation in Multimodal Large Language Models

TL;DR

LISA addresses object hallucination in Multimodal LLMs by introducing a layer-aware, training-free decoding framework that stabilizes cross-layer attention through spectral modulation, and adaptively fuses multi-layer signals via anchor-based routing. By explicitly partitioning transformers into shallow grounding, middle semantic, and deep suppressive zones, LISA suppresses unstable deep-layer activations while preserving grounded information. The approach combines layer-wise spectral suppression, cross-layer fusion, and token-wise soft fusion, enabling adaptive integration during decoding without retraining. Across MSCOCO-based benchmarks and multiple MLLMs, LISA consistently reduces hallucinations and improves visual grounding and recall, demonstrating practical, generalizable gains for reliable multimodal generation.

Abstract

Multimodal Large Language Models (MLLMs) excel in vision-language tasks such as image captioning but remain prone to object hallucinations, where they describe objects that do not appear in the image. To mitigate this, we propose LISA, a Layer-wise Integration and Suppression Approach. LISA leverages the layer-wise functional roles in MLLMs: shallow layers provide visual grounding, middle layers encode semantics, and deep layers tend to amplify spurious signals. First, layer-wise spectral modulation stabilizes attention by suppressing over-amplified activations in deeper layers while preserving alignment cues in earlier layers. Second, token-level logits from selected layers are fused via anchor-based routing, with token-wise anchor selection and soft logit fusion enabling adaptive integration during decoding. LISA is fully plug-and-play and can be seamlessly integrated into existing MLLMs, including Qwen2.5-VL. Experiments on multiple benchmarks show that LISA reduces hallucinations by up to 53.6% in and improves POPE F1 by up to 5.1%, demonstrating strong generalization across models and tasks. Our code is available at https://github.com/zhlisa1010-eng/LISA.

Paper Structure

This paper contains 27 sections, 12 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: An illustration of object hallucination in multimodal generation. The upper response, generated by the LLaVA model DBLP:conf/cvpr/LiuLLL24 using greedy decoding, hallucinates non-existent objects---"dining" and "table" (highlighted in red)---which are absent from the image. This reveals a typical failure of visual grounding. In contrast, the lower response, produced by LISA, provides a faithful and grounded description, free from object hallucination.
  • Figure 2: Layer-wise token probabilities. The token probability is calculated by applying the Softmax function to the raw logits of the token at each layer. Subfigures (a1)--(a3) show hallucinated tokens; subfigures (b1)--(b3) show non-hallucinated tokens. Y-axis: token probability; X-axis: layer index (0--31).
  • Figure 3: Layer-wise spectral energy during token prediction. A representative example from a Multimodal Large Language Model (MLLM) shows that query spectral energy varies across layers for a single token generation step, forming three zones: Preservation (blue) retains input signals; Interaction (yellow) builds semantic fusion; Suppression (red) shows spikes linked to hallucination. This pattern motivates layer-wise decoding strategies.
  • Figure 4: Layer-wise heatmaps of hallucinated tokens.Left: Greedy decoding shows sharp final-layer spikes (e.g., "ining", "table"). Right: LISA suppresses unstable activations and distributes confidence across layers.
  • Figure 5: Overview of LISA. LISA stabilizes multimodal generation by modulating the layer-wise spectral energy of transformer attention. It partitions layers into three spectral zones---preservation, interaction, and suppression---reflecting the layer-wise spectral energy progression of queries and keys. Layer-wise spectral suppression dynamically scales attention to dampen unstable deep-layer spikes while preserving shallow and middle-layer semantics. Cross-layer token fusion aggregates stable representations across selected anchor layers, weighted by spectral stability. Finally, token-wise anchor selection and soft logit fusion adaptively integrate multi-layer signals during decoding, ensuring each token draws from the most stable layers. Together, LISA combines spectral modulation, anchor-based fusion, and token-wise routing to mitigate hallucinations while retaining layer-wise information.
  • ...and 3 more figures