Table of Contents
Fetching ...

Fork-Merge Decoding: Enhancing Multimodal Understanding in Audio-Visual Large Language Models

Chaeyoung Jung, Youngjoon Jang, Jongmin Choi, Joon Son Chung

TL;DR

This work addresses modality bias in audio-visual large language models by proposing Fork-Merge Decoding (FMD), a training-free inference-time strategy that splits unimodal reasoning in an early fork phase and cohesive multimodal reasoning in a later merge phase. FMD is model- and fusion-agnostic, compatible with both token-wise and channel-wise fusion, and uses an attention-guided fusion to balance contributions without architectural changes. The authors validate FMD on VideoLLaMA2, video-SALMONN, and Qwen2.5-Omni across AVQA, MUSIC-AVQA, and AVHBench, reporting consistent gains in audio, video, and AV reasoning tasks, with pronounced improvements in AV captioning. The approach provides a computationally efficient, training-free means to mitigate modality bias and improve robust multimodal understanding in real-world AV-LLMs.

Abstract

The goal of this work is to enhance balanced multimodal understanding in audio-visual large language models (AV-LLMs) by addressing modality bias without additional training. In current AV-LLMs, audio and video features are typically processed jointly in the decoder. While this strategy facilitates unified multimodal understanding, it may introduce modality bias, where the model tends to over-rely on one modality due to imbalanced training signals. To mitigate this, we propose Fork-Merge Decoding (FMD), a simple yet effective inference-time strategy that requires no additional training or architectural modifications. FMD first performs modality-specific reasoning by processing audio-only and video-only inputs through the early decoder layers (fork), and then merges the resulting hidden states for joint reasoning in the remaining layers (merge). This separation allows each modality to be emphasized in the early stages while encouraging balanced contributions during integration. We validate our method on three representative AV-LLMs-VideoLLaMA2, video-SALMONN, and Qwen2.5-Omni-using three benchmark datasets. Experimental results show consistent gains in audio, video, and audio-visual reasoning tasks, highlighting the effectiveness of inference-time interventions for robust and efficient multimodal understanding.

Fork-Merge Decoding: Enhancing Multimodal Understanding in Audio-Visual Large Language Models

TL;DR

This work addresses modality bias in audio-visual large language models by proposing Fork-Merge Decoding (FMD), a training-free inference-time strategy that splits unimodal reasoning in an early fork phase and cohesive multimodal reasoning in a later merge phase. FMD is model- and fusion-agnostic, compatible with both token-wise and channel-wise fusion, and uses an attention-guided fusion to balance contributions without architectural changes. The authors validate FMD on VideoLLaMA2, video-SALMONN, and Qwen2.5-Omni across AVQA, MUSIC-AVQA, and AVHBench, reporting consistent gains in audio, video, and AV reasoning tasks, with pronounced improvements in AV captioning. The approach provides a computationally efficient, training-free means to mitigate modality bias and improve robust multimodal understanding in real-world AV-LLMs.

Abstract

The goal of this work is to enhance balanced multimodal understanding in audio-visual large language models (AV-LLMs) by addressing modality bias without additional training. In current AV-LLMs, audio and video features are typically processed jointly in the decoder. While this strategy facilitates unified multimodal understanding, it may introduce modality bias, where the model tends to over-rely on one modality due to imbalanced training signals. To mitigate this, we propose Fork-Merge Decoding (FMD), a simple yet effective inference-time strategy that requires no additional training or architectural modifications. FMD first performs modality-specific reasoning by processing audio-only and video-only inputs through the early decoder layers (fork), and then merges the resulting hidden states for joint reasoning in the remaining layers (merge). This separation allows each modality to be emphasized in the early stages while encouraging balanced contributions during integration. We validate our method on three representative AV-LLMs-VideoLLaMA2, video-SALMONN, and Qwen2.5-Omni-using three benchmark datasets. Experimental results show consistent gains in audio, video, and audio-visual reasoning tasks, highlighting the effectiveness of inference-time interventions for robust and efficient multimodal understanding.

Paper Structure

This paper contains 23 sections, 6 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Attention weight analysis in VideoLLaMA2 on the AVHBench dataset. We analyze 100 samples and examine the attention weights from the last decoder layer, focusing on the final token of the question. Attention is disproportionately allocated to video inputs over audio, revealing a modality bias. Our proposed FMD method reduces this gap by encouraging more balanced contributions from both modalities.
  • Figure 2: Overview of the Fork-Merge Decoding pipeline. The AV-LLM takes video frames, an audio waveform, and a question prompt as input. In the fork phase, FMD masks one modality while preserving the question, enabling independent reasoning. After $L_{\text{fork}}$ decoder layers, the merge phase combines ${\bm{h}}_{\text{fork}}^{\neg v}$ and ${\bm{h}}_{\text{fork}}^{\neg a}$ with an attention-derived weight $\alpha$, and the merged representation is processed by the remaining layers to generate answers with balanced multimodal understanding.
  • Figure 3: Qualitative results with VideoLLaMA2 on AVHBench and AVQA. Vanilla decoding often relies on a single modality, resulting in incomplete or inconsistent outputs, whereas FMD effectively integrates both audio and visual information to produce more accurate and coherent results.
  • Figure 4: Layer-wise hidden state similarity in VideoLLaMA2.$L_{\text{fork}}$ is chosen from the early stage.
  • Figure 5: Layer-wise attention weight comparison on VideoLLaMA2 using 600 samples from the AVHBench dataset. We analyze the attention weights from the final token in the last decoder layer, focusing on the distribution across video and audio segments. Deeper merging within the network results in reduced attention to visual tokens and heightened attention to audio tokens.
  • ...and 9 more figures