Table of Contents
Fetching ...

AVCD: Mitigating Hallucinations in Audio-Visual Large Language Models through Contrastive Decoding

Chaeyoung Jung, Youngjoon Jang, Joon Son Chung

TL;DR

This work tackles hallucinations in audio-visual large language models by introducing Audio-Visual Contrastive Decoding (AVCD), a training-free, trimodal decoding framework. AVCD identifies the dominant modality via attention and uses dominance-aware attentive masking to perturb weaker modalities, then applies a reformulated contrastive decoding objective that accounts for audio, visual, and language interactions. An entropy-guided adaptive decoding mechanism selectively applies AVCD to balance accuracy and efficiency. Empirical results on AVHBench and related multimodal QA datasets show AVCD consistently improves accuracy over baselines for AV-LLMs and generalizes to video-LLMs and image-LLMs, demonstrating robust mitigation of cross-modal hallucinations with practical speedups. The approach offers a scalable, plug-and-play solution to improve reliability of multimodal reasoning in real-world deployments.

Abstract

Hallucination remains a major challenge in multimodal large language models (MLLMs). To address this, various contrastive decoding (CD) methods have been proposed that contrasts original logits with hallucinated logits generated from perturbed inputs. While CD has shown promise in vision-language models (VLMs), it is not well-suited for AV-LLMs, where hallucinations often emerge from both unimodal and cross-modal combinations involving audio, video, and language. These intricate interactions call for a more adaptive and modality-aware decoding strategy. In this paper, we propose Audio-Visual Contrastive Decoding (AVCD)-a novel, training-free decoding framework designed to model trimodal interactions and suppress modality-induced hallucinations in AV-LLMs. Unlike previous CD methods in VLMs that corrupt a fixed modality, AVCD leverages attention distributions to dynamically identify less dominant modalities and applies attentive masking to generate perturbed output logits. To support CD in a trimodal setting, we also reformulate the original CD framework to jointly handle audio, visual, and textual inputs. Finally, to improve efficiency, we introduce entropy-guided adaptive decoding, which selectively skips unnecessary decoding steps based on the model's confidence in its predictions. Extensive experiments demonstrate that AVCD consistently outperforms existing decoding methods. Especially, on the AVHBench dataset, it improves accuracy by 2% for VideoLLaMA2 and 7% for video-SALMONN, demonstrating strong robustness and generalizability. Our code is available at https://github.com/kaistmm/AVCD.

AVCD: Mitigating Hallucinations in Audio-Visual Large Language Models through Contrastive Decoding

TL;DR

This work tackles hallucinations in audio-visual large language models by introducing Audio-Visual Contrastive Decoding (AVCD), a training-free, trimodal decoding framework. AVCD identifies the dominant modality via attention and uses dominance-aware attentive masking to perturb weaker modalities, then applies a reformulated contrastive decoding objective that accounts for audio, visual, and language interactions. An entropy-guided adaptive decoding mechanism selectively applies AVCD to balance accuracy and efficiency. Empirical results on AVHBench and related multimodal QA datasets show AVCD consistently improves accuracy over baselines for AV-LLMs and generalizes to video-LLMs and image-LLMs, demonstrating robust mitigation of cross-modal hallucinations with practical speedups. The approach offers a scalable, plug-and-play solution to improve reliability of multimodal reasoning in real-world deployments.

Abstract

Hallucination remains a major challenge in multimodal large language models (MLLMs). To address this, various contrastive decoding (CD) methods have been proposed that contrasts original logits with hallucinated logits generated from perturbed inputs. While CD has shown promise in vision-language models (VLMs), it is not well-suited for AV-LLMs, where hallucinations often emerge from both unimodal and cross-modal combinations involving audio, video, and language. These intricate interactions call for a more adaptive and modality-aware decoding strategy. In this paper, we propose Audio-Visual Contrastive Decoding (AVCD)-a novel, training-free decoding framework designed to model trimodal interactions and suppress modality-induced hallucinations in AV-LLMs. Unlike previous CD methods in VLMs that corrupt a fixed modality, AVCD leverages attention distributions to dynamically identify less dominant modalities and applies attentive masking to generate perturbed output logits. To support CD in a trimodal setting, we also reformulate the original CD framework to jointly handle audio, visual, and textual inputs. Finally, to improve efficiency, we introduce entropy-guided adaptive decoding, which selectively skips unnecessary decoding steps based on the model's confidence in its predictions. Extensive experiments demonstrate that AVCD consistently outperforms existing decoding methods. Especially, on the AVHBench dataset, it improves accuracy by 2% for VideoLLaMA2 and 7% for video-SALMONN, demonstrating strong robustness and generalizability. Our code is available at https://github.com/kaistmm/AVCD.

Paper Structure

This paper contains 28 sections, 18 equations, 11 figures, 10 tables, 1 algorithm.

Figures (11)

  • Figure 1: Hallucination mitigation with Audio-Visual Contrastive Decoding (AVCD). Inaccurate visual and audio-visual information is highlighted in red and blue, respectively, and corrected during inference via AVCD, enabling the production of precise details such as 'a shirt with a bird on it'.
  • Figure 2: Overall AVCD pipeline. Given an audio-visual input and a question, the model generates predicted logits along with a stacked modality dominance score $D_M$, computed by summing the attention values of the final query token across modalities from the attention map $A_{Q_K}$ (Eq. \ref{['eq;dominance']}). To improve efficiency, CD is skipped when the model's prediction has high confidence (i.e., low entropy). Otherwise, once a dominant modality is identified (e.g., language > video > audio), AVCD applies all possible masking combinations across the less dominant modalities (Audio, Video, and Audio-Visual) for CD. An attentive masking strategy is used to perturb the less dominant modalities, and CD is performed using Eq. \ref{['eq8']}. This process promotes balanced multimodal reasoning by enhancing the influence of weaker modalities (e.g., audio and video) while maintaining efficient inference.
  • Figure 3: Analysis of the attentive masking strategy. By masking a specific modality, its influence is reduced, allowing the model to focus on the remaining modalities when generating outputs.
  • Figure 4: Qualitative results on AV-LLM and video-LLM using VideoLLaMA2 cheng2024videollama. AVCD effectively leverages all modalities by mitigating the issue of certain modalities being ignored.
  • Figure 5: Comparison across entropy thresholds ($\tau$).$\tau$ controls over the trade-off between inference speed and accuracy. At $\tau = 0.8$, it achieves faster inference than VCD while outperforming Base decoding in accuracy.
  • ...and 6 more figures