Table of Contents
Fetching ...

MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, Yuki Mitsufuji

TL;DR

MMAudio introduces a multimodal joint training framework for high-quality video-to-audio synthesis conditioned on video and text. It jointly trains on audio-visual and audio-text data within a single transformer-based architecture, augmented by a conditional synchronization module that uses frame-level visual cues to improve audio-visual synchrony. The approach delivers state-of-the-art qualitative and quantitative results among public video-to-audio models, while remaining efficient and scalable, and it also demonstrates competitive text-to-audio performance without task-specific fine-tuning. The work shows that cross-modal data and joint semantic spaces can significantly boost audio quality, semantic alignment, and synchronization, establishing a foundation for broad multimodal audio-visual generation.

Abstract

We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. In contrast to single-modality training conditioned on (limited) video data only, MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples. Additionally, we improve audio-visual synchrony with a conditional synchronization module that aligns video conditions with audio latents at the frame level. Trained with a flow matching objective, MMAudio achieves new video-to-audio state-of-the-art among public models in terms of audio quality, semantic alignment, and audio-visual synchronization, while having a low inference time (1.23s to generate an 8s clip) and just 157M parameters. MMAudio also achieves surprisingly competitive performance in text-to-audio generation, showing that joint training does not hinder single-modality performance. Code and demo are available at: https://hkchengrex.github.io/MMAudio

MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

TL;DR

MMAudio introduces a multimodal joint training framework for high-quality video-to-audio synthesis conditioned on video and text. It jointly trains on audio-visual and audio-text data within a single transformer-based architecture, augmented by a conditional synchronization module that uses frame-level visual cues to improve audio-visual synchrony. The approach delivers state-of-the-art qualitative and quantitative results among public video-to-audio models, while remaining efficient and scalable, and it also demonstrates competitive text-to-audio performance without task-specific fine-tuning. The work shows that cross-modal data and joint semantic spaces can significantly boost audio quality, semantic alignment, and synchronization, establishing a foundation for broad multimodal audio-visual generation.

Abstract

We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. In contrast to single-modality training conditioned on (limited) video data only, MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples. Additionally, we improve audio-visual synchrony with a conditional synchronization module that aligns video conditions with audio latents at the frame level. Trained with a flow matching objective, MMAudio achieves new video-to-audio state-of-the-art among public models in terms of audio quality, semantic alignment, and audio-visual synchronization, while having a low inference time (1.23s to generate an 8s clip) and just 157M parameters. MMAudio also achieves surprisingly competitive performance in text-to-audio generation, showing that joint training does not hinder single-modality performance. Code and demo are available at: https://hkchengrex.github.io/MMAudio

Paper Structure

This paper contains 64 sections, 12 equations, 11 figures, 14 tables.

Figures (11)

  • Figure 1: In addition to training on audio-visual-(text) datasets, we perform multimodal joint training with high-quality, abundant audio-text data which enables effective data scaling. At inference, MMAudio generates conditions-aligned audio with video and/or text guidance.
  • Figure 2: Overview of the MMAudio flow-prediction network. Video conditions, text conditions, and audio latents jointly interact in the multimodal transformer network. A synchronization model (\ref{['sec:cond_sync']}) injects frame-aligned synchronization features for precise audio-visual synchrony.
  • Figure 3: We visualize the spectrograms of generated audio (by prior works and our method) and the ground-truth. Note our method generates the audio effects most closely aligned to the ground-truth, while other methods often generate sounds not explained by the visual input and not present in the ground-truth.
  • Figure A1: Sorted MMAudio and Movie Gen Audio performance scores in Movie Gen Audio Bench.
  • Figure A2: Examples of videos in Movie Gen Audio Bench that are well/not well covered by our training data. Left: with a familiar concept in our training data (516 swimming videos in the VGGSound training set), MMAudio achieves a higher IB-score. Right: with an unfamiliar concept (there are no videos about mashed potatoes in VGGSound chen2020vggsound, according to the provided labels), MMAudio attains a significantly lower IB-score.
  • ...and 6 more figures