AVCD: Mitigating Hallucinations in Audio-Visual Large Language Models through Contrastive Decoding
Chaeyoung Jung, Youngjoon Jang, Joon Son Chung
TL;DR
This work tackles hallucinations in audio-visual large language models by introducing Audio-Visual Contrastive Decoding (AVCD), a training-free, trimodal decoding framework. AVCD identifies the dominant modality via attention and uses dominance-aware attentive masking to perturb weaker modalities, then applies a reformulated contrastive decoding objective that accounts for audio, visual, and language interactions. An entropy-guided adaptive decoding mechanism selectively applies AVCD to balance accuracy and efficiency. Empirical results on AVHBench and related multimodal QA datasets show AVCD consistently improves accuracy over baselines for AV-LLMs and generalizes to video-LLMs and image-LLMs, demonstrating robust mitigation of cross-modal hallucinations with practical speedups. The approach offers a scalable, plug-and-play solution to improve reliability of multimodal reasoning in real-world deployments.
Abstract
Hallucination remains a major challenge in multimodal large language models (MLLMs). To address this, various contrastive decoding (CD) methods have been proposed that contrasts original logits with hallucinated logits generated from perturbed inputs. While CD has shown promise in vision-language models (VLMs), it is not well-suited for AV-LLMs, where hallucinations often emerge from both unimodal and cross-modal combinations involving audio, video, and language. These intricate interactions call for a more adaptive and modality-aware decoding strategy. In this paper, we propose Audio-Visual Contrastive Decoding (AVCD)-a novel, training-free decoding framework designed to model trimodal interactions and suppress modality-induced hallucinations in AV-LLMs. Unlike previous CD methods in VLMs that corrupt a fixed modality, AVCD leverages attention distributions to dynamically identify less dominant modalities and applies attentive masking to generate perturbed output logits. To support CD in a trimodal setting, we also reformulate the original CD framework to jointly handle audio, visual, and textual inputs. Finally, to improve efficiency, we introduce entropy-guided adaptive decoding, which selectively skips unnecessary decoding steps based on the model's confidence in its predictions. Extensive experiments demonstrate that AVCD consistently outperforms existing decoding methods. Especially, on the AVHBench dataset, it improves accuracy by 2% for VideoLLaMA2 and 7% for video-SALMONN, demonstrating strong robustness and generalizability. Our code is available at https://github.com/kaistmm/AVCD.
