Table of Contents
Fetching ...

3MDiT: Unified Tri-Modal Diffusion Transformer for Text-Driven Synchronized Audio-Video Generation

Yaoru Li, Heyu Si, Federico Landi, Pilar Oplustil Gallegos, Ioannis Koutsoumpas, O. Ricardo Cortez Vazquez, Ruiju Fu, Qi Guo, Xin Jin, Shunyu Liu, Mingli Song

TL;DR

3MDiT introduces a unified tri-modal diffusion transformer that jointly models text, audio, and video as evolving streams to achieve synchronized text-driven audio-video generation. The architecture features isomorphic audio branches, tri-modal omni-blocks for explicit cross-modal fusion, and an optional dynamic text conditioning mechanism that updates the prompt as modalities co-evolve. It supports training from scratch or plug-in adaptation to pretrained T2V backbones, and demonstrates improved audio-video synchronization across landscapes and diverse datasets with comprehensive ablations. The work advances multi-sensory generation by enabling controllable fusion strength and temporal alignment while preserving backbone priors for scalable deployment.

Abstract

Text-to-video (T2V) diffusion models have recently achieved impressive visual quality, yet most systems still generate silent clips and treat audio as a secondary concern. Existing audio-video generation pipelines typically decompose the task into cascaded stages, which accumulate errors across modalities and are trained under separate objectives. Recent joint audio-video generators alleviate this issue but often rely on dual-tower architectures with ad-hoc cross-modal bridges and static, single-shot text conditioning, making it difficult to both reuse T2V backbones and to reason about how audio, video and language interact over time. To address these challenges, we propose 3MDiT, a unified tri-modal diffusion transformer for text-driven synchronized audio-video generation. Our framework models video, audio and text as jointly evolving streams: an isomorphic audio branch mirrors a T2V backbone, tri-modal omni-blocks perform feature-level fusion across the three modalities, and an optional dynamic text conditioning mechanism updates the text representation as audio and video evidence co-evolve. The design supports two regimes: training from scratch on audio-video data, and orthogonally adapting a pretrained T2V model without modifying its backbone. Experiments show that our approach generates high-quality videos and realistic audio while consistently improving audio-video synchronization and tri-modal alignment across a range of quantitative metrics.

3MDiT: Unified Tri-Modal Diffusion Transformer for Text-Driven Synchronized Audio-Video Generation

TL;DR

3MDiT introduces a unified tri-modal diffusion transformer that jointly models text, audio, and video as evolving streams to achieve synchronized text-driven audio-video generation. The architecture features isomorphic audio branches, tri-modal omni-blocks for explicit cross-modal fusion, and an optional dynamic text conditioning mechanism that updates the prompt as modalities co-evolve. It supports training from scratch or plug-in adaptation to pretrained T2V backbones, and demonstrates improved audio-video synchronization across landscapes and diverse datasets with comprehensive ablations. The work advances multi-sensory generation by enabling controllable fusion strength and temporal alignment while preserving backbone priors for scalable deployment.

Abstract

Text-to-video (T2V) diffusion models have recently achieved impressive visual quality, yet most systems still generate silent clips and treat audio as a secondary concern. Existing audio-video generation pipelines typically decompose the task into cascaded stages, which accumulate errors across modalities and are trained under separate objectives. Recent joint audio-video generators alleviate this issue but often rely on dual-tower architectures with ad-hoc cross-modal bridges and static, single-shot text conditioning, making it difficult to both reuse T2V backbones and to reason about how audio, video and language interact over time. To address these challenges, we propose 3MDiT, a unified tri-modal diffusion transformer for text-driven synchronized audio-video generation. Our framework models video, audio and text as jointly evolving streams: an isomorphic audio branch mirrors a T2V backbone, tri-modal omni-blocks perform feature-level fusion across the three modalities, and an optional dynamic text conditioning mechanism updates the text representation as audio and video evidence co-evolve. The design supports two regimes: training from scratch on audio-video data, and orthogonally adapting a pretrained T2V model without modifying its backbone. Experiments show that our approach generates high-quality videos and realistic audio while consistently improving audio-video synchronization and tri-modal alignment across a range of quantitative metrics.

Paper Structure

This paper contains 72 sections, 57 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: (a) Overall architecture: isomorphic Video/AudioDiT branches with pluggable omni-blocks; (b) Omni-blocks: concatenate $(x,y,a)$, apply 3D/1D RoPE to video/audio (text unrotated), perform joint attention, then split by modality and update with gated MLPs; (c) Dynamic text conditioning: between Video-DiT and Audio-DiT, a lightweight cross-attention dynamically refines the shared text representation to improve cross-modal synchrony.
  • Figure 2: Metrics progress during training of the Wan-adapted model on the Landscape dataset. The horizontal axis indicates dataset repeats (number of full passes over the training set). Audio quality improves rapidly while visual fidelity remains stable, confirming compatibility between the frozen video backbone and learned audio pathways.
  • Figure 3: Effect of the number of audio blocks and omni-blocks on joint audio-video synchronization. ‘Audio8-Omni8’ denotes a model variant with 8 audio blocks and 8 omni-blocks; similarly, other naming conventions follow the same pattern. We report FVD/FAD (lower is better) and AVH/Javis scores (higher is better) under the Landscape dataset using the Wan 2.1 T2V-1.3B backbone as the base model, at 512$\times$288 and 10 fps with CFG=2.
  • Figure 4: An example generated using SD3 + A + D (referenced in Table \ref{['tab:avsync15']}). Video (frames, on top) and audio (waveform and spectrogram) for the prompt: In a dimly lit, misty night scene, a large toad with mottled brown and white skin sits motionless on a patch of wet, sandy ground. (...) Audio: The deep, resonant croaking of a large toad echoes through the misty night air (...). The audio and video are aligned in time as the sound of the toad's croaking matches the movement of the toad's throat. The full caption specifies that while the throat moves, the mouth remains closed.
  • Figure 5: Case Studies. The top-left example is generated by the Wan-adapted model trained and evaluated on the Landscape dataset, while the remaining three examples are produced by the SD3-adapted models trained and evaluated on AVSync15. Each panel shows uniformly sampled video frames (six per sample), the corresponding waveform, and an 0-10 kHz log-mel spectrogram. The four cases illustrate distinct acoustic events: rain, gunshots, rooster crowing, and cello play (ordered left-to-right, top-to-bottom).