MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

Ludan Ruan; Yiyang Ma; Huan Yang; Huiguo He; Bei Liu; Jianlong Fu; Nicholas Jing Yuan; Qin Jin; Baining Guo

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, Baining Guo

TL;DR

MM-Diffusion introduces the first joint diffusion framework for audio-video generation by employing two coupled denoising networks and a random-shift multi-modal attention mechanism to align syncronized content. The approach unifies forward diffusion for each modality with a shared reverse model that leverages both audio and video, enabling unconditional generation and strong zero-shot conditional capabilities. Empirical results on Landscape and AIST++ show significant gains in objective metrics (FVD, FAD, KVD) and clear support from human evaluations, including Turing tests. The work demonstrates the practicality of multi-modal diffusion for high-fidelity, semantically aligned audio-video synthesis and opens avenues for prompts and editing-based multimodal generation.

Abstract

We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously, towards high-quality realistic videos. To generate joint audio-video pairs, we propose a novel Multi-Modal Diffusion model (i.e., MM-Diffusion), with two-coupled denoising autoencoders. In contrast to existing single-modal diffusion models, MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design. Two subnets for audio and video learn to gradually generate aligned audio-video pairs from Gaussian noises. To ensure semantic consistency across modalities, we propose a novel random-shift based attention block bridging over the two subnets, which enables efficient cross-modal alignment, and thus reinforces the audio-video fidelity for each other. Extensive experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks (e.g., video-to-audio). In particular, we achieve the best FVD and FAD on Landscape and AIST++ dancing datasets. Turing tests of 10k votes further demonstrate dominant preferences for our model. The code and pre-trained models can be downloaded at https://github.com/researchmm/MM-Diffusion.

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

TL;DR

Abstract

Paper Structure (22 sections, 8 equations, 9 figures, 8 tables)

This paper contains 22 sections, 8 equations, 9 figures, 8 tables.

Introduction
Related Work
Diffusion Probabilistic Models.
Cross-Modality Generation.
Approach
Preliminaries of Vanilla Diffusion
Multi-Modal Diffusion Models
Coupled U-Net for Joint Audio-Video Denoising
Efficient Multi-Modal Blocks.
Random-Shift based Multi-Modal Attention.
Zero-Shot Transfer to Conditional Generation
Experiments
Implementation Details
Datasets
Evaluation Metrics
...and 7 more sections

Figures (9)

Figure 1: Examples of generated video frames ($256 \times 256$) and audio spectrograms from Landscape lee2022sound and AIST++ datasets li2021learn. We can see vivid bonfires burning, beautiful sea wave moving, and elegant dancing. Matched audio is generated with video appearances (e.g., the periodical rhythm for dancers). The complete high-fidelity videos and audio can be found in supplementary materials.
Figure 2: An illustration of multi-modal denoising diffusion process. Forward diffusion (dotted arrow) maps audio & video data to noise independently, while the reverse process (solid arrow) gradually reconstructs multi-modal contents by a unified model $\theta_{av}$.
Figure 3: Overview of the proposed MM-Diffusion framework. Coupled U-Net contains coupled audio and video streams (indicated by green and blue blocks respectively) at each denoising diffusion step in (a). Each MM-Block encodes audio and video by 1D dilated audio convolutions, and 2D+1D spatial-temporal visual convolutions in (b). An efficient random-shift based multi-modal attention module is further proposed in (c) to facilitate specific inter-modality alignment and avoid redundant computations.
Figure 4: More visual examples of generated video frames ($256 \times 256$) with semantic-consistent audio (shown in spectrograms). Some cases vividly show the wind blowing in snow mountains, and some show continuous river sound with beautiful scenes.
Figure 5: Illustration of several randomly-selected examples generated by zero-shot transferring to conditional generation. We adopt the gradient-guided method for better results.
...and 4 more figures

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

TL;DR

Abstract

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (9)