Table of Contents
Fetching ...

AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation

Kai Wang, Shijian Deng, Jing Shi, Dimitrios Hatzinakos, Yapeng Tian

TL;DR

This work introduces AV-DiT, a multimodal diffusion transformer that repurposes a frozen image-trained DiT backbone with lightweight adapters to jointly generate synchronized audio and video. By encoding video and audio into latent representations and applying modality-specific adapters, temporal attention, and multimodal fusion within a shared transformer, AV-DiT achieves competitive or superior quality with far fewer trainable parameters and faster inference than prior methods. Extensive experiments on AIST++ and Landscape demonstrate effective video realism and audio fidelity without modality-specific re-training of the backbone. The approach highlights that a single pre-trained image generator can underpin efficient joint audio-video generation, with code and models slated for release.

Abstract

Recent Diffusion Transformers (DiTs) have shown impressive capabilities in generating high-quality single-modality content, including images, videos, and audio. However, it is still under-explored whether the transformer-based diffuser can efficiently denoise the Gaussian noises towards superb multimodal content creation. To bridge this gap, we introduce AV-DiT, a novel and efficient audio-visual diffusion transformer designed to generate high-quality, realistic videos with both visual and audio tracks. To minimize model complexity and computational costs, AV-DiT utilizes a shared DiT backbone pre-trained on image-only data, with only lightweight, newly inserted adapters being trainable. This shared backbone facilitates both audio and video generation. Specifically, the video branch incorporates a trainable temporal attention layer into a frozen pre-trained DiT block for temporal consistency. Additionally, a small number of trainable parameters adapt the image-based DiT block for audio generation. An extra shared DiT block, equipped with lightweight parameters, facilitates feature interaction between audio and visual modalities, ensuring alignment. Extensive experiments on the AIST++ and Landscape datasets demonstrate that AV-DiT achieves state-of-the-art performance in joint audio-visual generation with significantly fewer tunable parameters. Furthermore, our results highlight that a single shared image generative backbone with modality-specific adaptations is sufficient for constructing a joint audio-video generator. Our source code and pre-trained models will be released.

AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation

TL;DR

This work introduces AV-DiT, a multimodal diffusion transformer that repurposes a frozen image-trained DiT backbone with lightweight adapters to jointly generate synchronized audio and video. By encoding video and audio into latent representations and applying modality-specific adapters, temporal attention, and multimodal fusion within a shared transformer, AV-DiT achieves competitive or superior quality with far fewer trainable parameters and faster inference than prior methods. Extensive experiments on AIST++ and Landscape demonstrate effective video realism and audio fidelity without modality-specific re-training of the backbone. The approach highlights that a single pre-trained image generator can underpin efficient joint audio-video generation, with code and models slated for release.

Abstract

Recent Diffusion Transformers (DiTs) have shown impressive capabilities in generating high-quality single-modality content, including images, videos, and audio. However, it is still under-explored whether the transformer-based diffuser can efficiently denoise the Gaussian noises towards superb multimodal content creation. To bridge this gap, we introduce AV-DiT, a novel and efficient audio-visual diffusion transformer designed to generate high-quality, realistic videos with both visual and audio tracks. To minimize model complexity and computational costs, AV-DiT utilizes a shared DiT backbone pre-trained on image-only data, with only lightweight, newly inserted adapters being trainable. This shared backbone facilitates both audio and video generation. Specifically, the video branch incorporates a trainable temporal attention layer into a frozen pre-trained DiT block for temporal consistency. Additionally, a small number of trainable parameters adapt the image-based DiT block for audio generation. An extra shared DiT block, equipped with lightweight parameters, facilitates feature interaction between audio and visual modalities, ensuring alignment. Extensive experiments on the AIST++ and Landscape datasets demonstrate that AV-DiT achieves state-of-the-art performance in joint audio-visual generation with significantly fewer tunable parameters. Furthermore, our results highlight that a single shared image generative backbone with modality-specific adaptations is sufficient for constructing a joint audio-video generator. Our source code and pre-trained models will be released.
Paper Structure (13 sections, 2 equations, 9 figures, 2 tables)

This paper contains 13 sections, 2 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 2: Illustration of our proposed AV-DiT for joint audio and video generation. Our AV-DiT leverages a shared frozen DiT backbone pre-trained on image-only data to simultaneously generate high-quality and realistic audio and video, where only inserted modality-specific adapters are trainable while the original pre-trained weights are frozen.
  • Figure 3: Qualitative examples of our AV-DiT and the MM-Diffusion model. Compared with MM-Difusion, our method generates higher quality and more realistic videos. Meanwhile, our generated audio spectrogram involves fewer artifacts and restores more approximate structures reflecting the visual scenes. For example, our generated audio sample of Landscape scenes possesses more details that demonstrate the sound of waves lapping on the shore.
  • Figure 4: Generatation results on Landscape. Our AV-DiT yields higher quality and more realistic sounding videos than Seeing and Hearing xing2024seeing.
  • Figure 5: Influence of various adapter layers
  • Figure 6: Different adapter ratios
  • ...and 4 more figures