Table of Contents
Fetching ...

ProAV-DiT: A Projected Latent Diffusion Transformer for Efficient Synchronized Audio-Video Generation

Jiahui Sun, Weining Wang, Mingzhen Sun, Yirong Yang, Xinxin Zhu, Jing Liu

TL;DR

ProAV-DiT introduces a unified framework for efficient, synchronized audio-video generation by transforming audio into video-like representations and encoding both modalities into a shared latent space via a Multi-scale Dual-stream Spatio-Temporal Autoencoder (MDSA). A Cross-Modal Diffusion Transformer (STDiT) then operates in this latent space, stacking modality latents into a compact sequence and applying spatiotemporal attention to jointly model audiovisual semantics. The proposed MDSA employs orthogonal decomposition into temporal, height, and width axes with multi-scale temporal self-attention and group cross-modal attention to improve cross-modal alignment and temporal coherence, complemented by a Bi-Block CrossAttn for localized fusion. Empirical results on Landscape, AIST++ and AudioSet show state-of-the-art generation quality and substantial efficiency gains, including faster sampling and lower memory usage, along with strong AV-alignment metrics and favorable human judgments. These findings indicate ProAV-DiT as a scalable, generalizable solution for high-fidelity, synchronized SVG tasks in open-domain settings.

Abstract

Sounding Video Generation (SVG) remains a challenging task due to the inherent structural misalignment between audio and video, as well as the high computational cost of multimodal data processing. In this paper, we introduce ProAV-DiT, a Projected Latent Diffusion Transformer designed for efficient and synchronized audio-video generation. To address structural inconsistencies, we preprocess raw audio into video-like representations, aligning both the temporal and spatial dimensions between audio and video. At its core, ProAV-DiT adopts a Multi-scale Dual-stream Spatio-Temporal Autoencoder (MDSA), which projects both modalities into a unified latent space using orthogonal decomposition, enabling fine-grained spatiotemporal modeling and semantic alignment. To further enhance temporal coherence and modality-specific fusion, we introduce a multi-scale attention mechanism, which consists of multi-scale temporal self-attention and group cross-modal attention. Furthermore, we stack the 2D latents from MDSA into a unified 3D latent space, which is processed by a spatio-temporal diffusion Transformer. This design efficiently models spatiotemporal dependencies, enabling the generation of high-fidelity synchronized audio-video content while reducing computational overhead. Extensive experiments conducted on standard benchmarks demonstrate that ProAV-DiT outperforms existing methods in both generation quality and computational efficiency.

ProAV-DiT: A Projected Latent Diffusion Transformer for Efficient Synchronized Audio-Video Generation

TL;DR

ProAV-DiT introduces a unified framework for efficient, synchronized audio-video generation by transforming audio into video-like representations and encoding both modalities into a shared latent space via a Multi-scale Dual-stream Spatio-Temporal Autoencoder (MDSA). A Cross-Modal Diffusion Transformer (STDiT) then operates in this latent space, stacking modality latents into a compact sequence and applying spatiotemporal attention to jointly model audiovisual semantics. The proposed MDSA employs orthogonal decomposition into temporal, height, and width axes with multi-scale temporal self-attention and group cross-modal attention to improve cross-modal alignment and temporal coherence, complemented by a Bi-Block CrossAttn for localized fusion. Empirical results on Landscape, AIST++ and AudioSet show state-of-the-art generation quality and substantial efficiency gains, including faster sampling and lower memory usage, along with strong AV-alignment metrics and favorable human judgments. These findings indicate ProAV-DiT as a scalable, generalizable solution for high-fidelity, synchronized SVG tasks in open-domain settings.

Abstract

Sounding Video Generation (SVG) remains a challenging task due to the inherent structural misalignment between audio and video, as well as the high computational cost of multimodal data processing. In this paper, we introduce ProAV-DiT, a Projected Latent Diffusion Transformer designed for efficient and synchronized audio-video generation. To address structural inconsistencies, we preprocess raw audio into video-like representations, aligning both the temporal and spatial dimensions between audio and video. At its core, ProAV-DiT adopts a Multi-scale Dual-stream Spatio-Temporal Autoencoder (MDSA), which projects both modalities into a unified latent space using orthogonal decomposition, enabling fine-grained spatiotemporal modeling and semantic alignment. To further enhance temporal coherence and modality-specific fusion, we introduce a multi-scale attention mechanism, which consists of multi-scale temporal self-attention and group cross-modal attention. Furthermore, we stack the 2D latents from MDSA into a unified 3D latent space, which is processed by a spatio-temporal diffusion Transformer. This design efficiently models spatiotemporal dependencies, enabling the generation of high-fidelity synchronized audio-video content while reducing computational overhead. Extensive experiments conducted on standard benchmarks demonstrate that ProAV-DiT outperforms existing methods in both generation quality and computational efficiency.

Paper Structure

This paper contains 40 sections, 14 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Video-like audio representation construction. The audio is segmented by frame, divided into audio segments in each colored square, and then converted into the Mel spectrogram sequence below. Each spectrogram has the same duration as the video frame, and the sequence is stacked along the time dimension to form a video-like audio representation ($A \in \mathbb{R}^{T \times H \times W}$), where the spectrogram acts as an image-like frame.
  • Figure 2: (a) Given audio and video inputs, the audio is first converted into a video-like representation ($\mathcal{A}$). Both modalities are encoded via video-to-3D-latent encoders and projected into 2D latents through orthogonal decomposition. These latents are enhanced and fused using a multi-scale attention mechanism: temporal consistency (HT, WT) is modeled by MT-SelfAttn, spatial features (HW) are refined by SelfAttn, and GCM-Attn enables bidirectional cross-modal interaction. The resulting 2D latents are further processed by Bi-Block CrossAttn and decoded by a dual-modal decoder to produce synchronized audio-video outputs. (b) Audio and video latents are concatenated along the temporal axis to form a unified 3D latent representation, which serves as input to the ST-DiT. During iterative diffusion, ST-DiT progressively denoises the latents at each timestep. After the final step, the purified latents are decoded to synthesize video with temporally aligned audio-video streams.
  • Figure 3: Results of our method on Landscape, including spectrogram visualization images and video frames.
  • Figure 4: Qualitative comparison of ProAV-DiT with MM-Diffusion and MM-LDM.
  • Figure 5: Video reconstruction results of our MDSA on the Landscape dataset
  • ...and 4 more figures