Improving Joint Audio-Video Generation with Cross-Modal Context Learning

Bingqi Ma; Linlong Lang; Ming Zhang; Dailan He; Xingtong Ge; Yi Zhang; Guanglu Song; Yu Liu

Improving Joint Audio-Video Generation with Cross-Modal Context Learning

Bingqi Ma, Linlong Lang, Ming Zhang, Dailan He, Xingtong Ge, Yi Zhang, Guanglu Song, Yu Liu

Abstract

The dual-stream transformer architecture-based joint audio-video generation method has become the dominant paradigm in current research. By incorporating pre-trained video diffusion models and audio diffusion models, along with a cross-modal interaction attention module, high-quality, temporally synchronized audio-video content can be generated with minimal training data. In this paper, we first revisit the dual-stream transformer paradigm and further analyze its limitations, including model manifold variations caused by the gating mechanism controlling cross-modal interactions, biases in multi-modal background regions introduced by cross-modal attention, and the inconsistencies in multi-modal classifier-free guidance (CFG) during training and inference, as well as conflicts between multiple conditions. To alleviate these issues, we propose Cross-Modal Context Learning (CCL), equipped with several carefully designed modules. Temporally Aligned RoPE and Partitioning (TARP) effectively enhances the temporal alignment between audio latent and video latent representations. The Learnable Context Tokens (LCT) and Dynamic Context Routing (DCR) in the Cross-Modal Context Attention (CCA) module provide stable unconditional anchors for cross-modal information, while dynamically routing based on different training tasks, further enhancing the model's convergence speed and generation quality. During inference, Unconditional Context Guidance (UCG) leverages the unconditional support provided by LCT to facilitate different forms of CFG, improving train-inference consistency and further alleviating conflicts. Through comprehensive evaluations, CCL achieves state-of-the-art performance compared with recent academic methods while requiring substantially fewer resources.

Improving Joint Audio-Video Generation with Cross-Modal Context Learning

Abstract

Paper Structure (23 sections, 7 equations, 6 figures, 2 tables)

This paper contains 23 sections, 7 equations, 6 figures, 2 tables.

Introduction
Related work
Video Generation Models
Joint Audio-Video Generation
Revisiting the Dual-Stream Transformer Pipeline
The Gating Mechanism
Cross-Modal Attention
Multi-Modal CFG
Cross-Modal Context Learning
Temporally Aligned RoPE and Partitioning
Cross-Modal Context Attention
Learnable Context Tokens
Dynamic Context Routing
Unconditional Context Guidance
Multi-Task Training
...and 8 more sections

Figures (6)

Figure 1: We demonstrate several capabilities of CCL, including multilingual human speech generation, environmental sound synthesis, music generation, background speech generation, storyboard-style scene transitions, and dialogue generation, as well as applicability to real-world scenarios such as beauty tutorial production.
Figure 2: The gating mechanism alters the optimization objective during training, which affects training efficiency.
Figure 3: The visualization of cross-modal attention in Ovi ovi. We observe that background regions in the audio attend strongly to random regions in the video, while background regions in the video assign high attention to the final audio token. This suggests that the cross-modal attention is semantically misaligned, introducing positional bias and consequently degrading model performance.
Figure 4: The pipeline of our proposed Cross-Modal Context Learning. CCL follows the conventional dual-stream transformer architecture, equipped with several novel-designed modules, enabling efficient and effective joint audio-video generation with high consistency. The figure illustrates the implementation details of proposed modules. For Dynamic Context Routing, the various colors denote that the corresponding colored paths on the left are in an activated state.
Figure 5: The visualization of the training loss when adopting the gate mechanism compared with leveraging the CCL. We only sampled the loss for the joint audio-video generation task and applied the EMA operation. Notably, due to the instability of the loss during the early stages of training, we begin visualizing from iteration 100.
...and 1 more figures

Improving Joint Audio-Video Generation with Cross-Modal Context Learning

Abstract

Improving Joint Audio-Video Generation with Cross-Modal Context Learning

Authors

Abstract

Table of Contents

Figures (6)