ProAV-DiT: A Projected Latent Diffusion Transformer for Efficient Synchronized Audio-Video Generation
Jiahui Sun, Weining Wang, Mingzhen Sun, Yirong Yang, Xinxin Zhu, Jing Liu
TL;DR
ProAV-DiT introduces a unified framework for efficient, synchronized audio-video generation by transforming audio into video-like representations and encoding both modalities into a shared latent space via a Multi-scale Dual-stream Spatio-Temporal Autoencoder (MDSA). A Cross-Modal Diffusion Transformer (STDiT) then operates in this latent space, stacking modality latents into a compact sequence and applying spatiotemporal attention to jointly model audiovisual semantics. The proposed MDSA employs orthogonal decomposition into temporal, height, and width axes with multi-scale temporal self-attention and group cross-modal attention to improve cross-modal alignment and temporal coherence, complemented by a Bi-Block CrossAttn for localized fusion. Empirical results on Landscape, AIST++ and AudioSet show state-of-the-art generation quality and substantial efficiency gains, including faster sampling and lower memory usage, along with strong AV-alignment metrics and favorable human judgments. These findings indicate ProAV-DiT as a scalable, generalizable solution for high-fidelity, synchronized SVG tasks in open-domain settings.
Abstract
Sounding Video Generation (SVG) remains a challenging task due to the inherent structural misalignment between audio and video, as well as the high computational cost of multimodal data processing. In this paper, we introduce ProAV-DiT, a Projected Latent Diffusion Transformer designed for efficient and synchronized audio-video generation. To address structural inconsistencies, we preprocess raw audio into video-like representations, aligning both the temporal and spatial dimensions between audio and video. At its core, ProAV-DiT adopts a Multi-scale Dual-stream Spatio-Temporal Autoencoder (MDSA), which projects both modalities into a unified latent space using orthogonal decomposition, enabling fine-grained spatiotemporal modeling and semantic alignment. To further enhance temporal coherence and modality-specific fusion, we introduce a multi-scale attention mechanism, which consists of multi-scale temporal self-attention and group cross-modal attention. Furthermore, we stack the 2D latents from MDSA into a unified 3D latent space, which is processed by a spatio-temporal diffusion Transformer. This design efficiently models spatiotemporal dependencies, enabling the generation of high-fidelity synchronized audio-video content while reducing computational overhead. Extensive experiments conducted on standard benchmarks demonstrate that ProAV-DiT outperforms existing methods in both generation quality and computational efficiency.
