Table of Contents
Fetching ...

MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization

Ashutosh Chaubey, Jiacheng Pang, Mohammad Soleymani

TL;DR

Modality-Decoupled Direct Preference Optimization (MoD-DPO) is proposed, a simple and effective framework for improving modality grounding in omni LLMs that incorporates a language-prior debiasing penalty that discourages hallucination-prone text-only responses.

Abstract

Omni-modal large language models (omni LLMs) have recently achieved strong performance across audiovisual understanding tasks, yet they remain highly susceptible to cross-modal hallucinations arising from spurious correlations and dominant language priors. In this work, we propose Modality-Decoupled Direct Preference Optimization (MoD-DPO), a simple and effective framework for improving modality grounding in omni LLMs. MoD-DPO introduces modality-aware regularization terms that explicitly enforce invariance to corruptions in irrelevant modalities and sensitivity to perturbations in relevant modalities, thereby reducing unintended cross-modal interactions. To further mitigate over-reliance on textual priors, we incorporate a language-prior debiasing penalty that discourages hallucination-prone text-only responses. Extensive experiments across multiple audiovisual hallucination benchmarks demonstrate that MoD-DPO consistently improves perception accuracy and hallucination resistance, outperforming previous preference optimization baselines under similar training budgets. Our findings underscore the importance of modality-faithful alignment and demonstrate a scalable path toward more reliable and resilient multimodal foundation models.

MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization

TL;DR

Modality-Decoupled Direct Preference Optimization (MoD-DPO) is proposed, a simple and effective framework for improving modality grounding in omni LLMs that incorporates a language-prior debiasing penalty that discourages hallucination-prone text-only responses.

Abstract

Omni-modal large language models (omni LLMs) have recently achieved strong performance across audiovisual understanding tasks, yet they remain highly susceptible to cross-modal hallucinations arising from spurious correlations and dominant language priors. In this work, we propose Modality-Decoupled Direct Preference Optimization (MoD-DPO), a simple and effective framework for improving modality grounding in omni LLMs. MoD-DPO introduces modality-aware regularization terms that explicitly enforce invariance to corruptions in irrelevant modalities and sensitivity to perturbations in relevant modalities, thereby reducing unintended cross-modal interactions. To further mitigate over-reliance on textual priors, we incorporate a language-prior debiasing penalty that discourages hallucination-prone text-only responses. Extensive experiments across multiple audiovisual hallucination benchmarks demonstrate that MoD-DPO consistently improves perception accuracy and hallucination resistance, outperforming previous preference optimization baselines under similar training budgets. Our findings underscore the importance of modality-faithful alignment and demonstrate a scalable path toward more reliable and resilient multimodal foundation models.
Paper Structure (46 sections, 27 equations, 15 figures, 10 tables)

This paper contains 46 sections, 27 equations, 15 figures, 10 tables.

Figures (15)

  • Figure 1: Comparison of the proposed MoD-DPO with other preference optimization baselines for mitigating cross-modal hallucinations arising from spurious inter-modality correlations and over-reliance on language priors.
  • Figure 2: Comparison of proposed MoD-DPO++ with other preference optimization baselines on AVHBench sung-bin2025avhbench and CMM leng2025the_cmmbenchmark. Average accuracy is reported. ADVH: Audio-driven video hallucination, VDAH: Video-driven audio hallucination.
  • Figure 3: Modality Decoupled Preference Optimization. In addition to the response preference and reference regularization terms in DPO rafailov2023direct_dpo, we include additional KL regularization terms to increase model invariance to irrelevant modalities and model sensitivity to relevant modalities. Additionally, we penalize response generation with only text inputs to remove language priors in the model.
  • Figure 4: Preference Data Generation Pipeline. We disentangle the audiovisual input to obtain separate audio and visual captions or tags (Stage 1), which are then used to generate QA pairs for Stage 2. Finally, we create preference data for modality-specific questions by constructing chosen responses using relevant modality information and rejected responses using irrelevant modality information.
  • Figure 5: Preference Data Statistics.(Left) Number of preference samples generated from different source datasets. (Right) Composition of samples belonging to different tasks.
  • ...and 10 more figures