LISA: A Layer-wise Integration and Suppression Approach for Hallucination Mitigation in Multimodal Large Language Models
Zhihui Guo, Xin Man, Hui Xu, Jie Shao, Zhiguo Jiang, Xianchao Zhang, Heng Tao Shen
TL;DR
LISA addresses object hallucination in Multimodal LLMs by introducing a layer-aware, training-free decoding framework that stabilizes cross-layer attention through spectral modulation, and adaptively fuses multi-layer signals via anchor-based routing. By explicitly partitioning transformers into shallow grounding, middle semantic, and deep suppressive zones, LISA suppresses unstable deep-layer activations while preserving grounded information. The approach combines layer-wise spectral suppression, cross-layer fusion, and token-wise soft fusion, enabling adaptive integration during decoding without retraining. Across MSCOCO-based benchmarks and multiple MLLMs, LISA consistently reduces hallucinations and improves visual grounding and recall, demonstrating practical, generalizable gains for reliable multimodal generation.
Abstract
Multimodal Large Language Models (MLLMs) excel in vision-language tasks such as image captioning but remain prone to object hallucinations, where they describe objects that do not appear in the image. To mitigate this, we propose LISA, a Layer-wise Integration and Suppression Approach. LISA leverages the layer-wise functional roles in MLLMs: shallow layers provide visual grounding, middle layers encode semantics, and deep layers tend to amplify spurious signals. First, layer-wise spectral modulation stabilizes attention by suppressing over-amplified activations in deeper layers while preserving alignment cues in earlier layers. Second, token-level logits from selected layers are fused via anchor-based routing, with token-wise anchor selection and soft logit fusion enabling adaptive integration during decoding. LISA is fully plug-and-play and can be seamlessly integrated into existing MLLMs, including Qwen2.5-VL. Experiments on multiple benchmarks show that LISA reduces hallucinations by up to 53.6% in $\text{CHAIR}_\text{I}$ and improves POPE F1 by up to 5.1%, demonstrating strong generalization across models and tasks. Our code is available at https://github.com/zhlisa1010-eng/LISA.
