3MDiT: Unified Tri-Modal Diffusion Transformer for Text-Driven Synchronized Audio-Video Generation

Yaoru Li; Heyu Si; Federico Landi; Pilar Oplustil Gallegos; Ioannis Koutsoumpas; O. Ricardo Cortez Vazquez; Ruiju Fu; Qi Guo; Xin Jin; Shunyu Liu; Mingli Song

3MDiT: Unified Tri-Modal Diffusion Transformer for Text-Driven Synchronized Audio-Video Generation

Yaoru Li, Heyu Si, Federico Landi, Pilar Oplustil Gallegos, Ioannis Koutsoumpas, O. Ricardo Cortez Vazquez, Ruiju Fu, Qi Guo, Xin Jin, Shunyu Liu, Mingli Song

TL;DR

3MDiT introduces a unified tri-modal diffusion transformer that jointly models text, audio, and video as evolving streams to achieve synchronized text-driven audio-video generation. The architecture features isomorphic audio branches, tri-modal omni-blocks for explicit cross-modal fusion, and an optional dynamic text conditioning mechanism that updates the prompt as modalities co-evolve. It supports training from scratch or plug-in adaptation to pretrained T2V backbones, and demonstrates improved audio-video synchronization across landscapes and diverse datasets with comprehensive ablations. The work advances multi-sensory generation by enabling controllable fusion strength and temporal alignment while preserving backbone priors for scalable deployment.

Abstract

Text-to-video (T2V) diffusion models have recently achieved impressive visual quality, yet most systems still generate silent clips and treat audio as a secondary concern. Existing audio-video generation pipelines typically decompose the task into cascaded stages, which accumulate errors across modalities and are trained under separate objectives. Recent joint audio-video generators alleviate this issue but often rely on dual-tower architectures with ad-hoc cross-modal bridges and static, single-shot text conditioning, making it difficult to both reuse T2V backbones and to reason about how audio, video and language interact over time. To address these challenges, we propose 3MDiT, a unified tri-modal diffusion transformer for text-driven synchronized audio-video generation. Our framework models video, audio and text as jointly evolving streams: an isomorphic audio branch mirrors a T2V backbone, tri-modal omni-blocks perform feature-level fusion across the three modalities, and an optional dynamic text conditioning mechanism updates the text representation as audio and video evidence co-evolve. The design supports two regimes: training from scratch on audio-video data, and orthogonally adapting a pretrained T2V model without modifying its backbone. Experiments show that our approach generates high-quality videos and realistic audio while consistently improving audio-video synchronization and tri-modal alignment across a range of quantitative metrics.

3MDiT: Unified Tri-Modal Diffusion Transformer for Text-Driven Synchronized Audio-Video Generation

TL;DR

Abstract

3MDiT: Unified Tri-Modal Diffusion Transformer for Text-Driven Synchronized Audio-Video Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)