Table of Contents
Fetching ...

UniForm: A Unified Multi-Task Diffusion Transformer for Audio-Video Generation

Lei Zhao, Linfeng Feng, Dongxu Ge, Rujin Chen, Fangqiu Yi, Chi Zhang, Xiao-Lei Zhang, Xuelong Li

TL;DR

UniForm tackles the challenge of cross-modal generation by unifying audio and video synthesis within a single diffusion-transformer framework. It builds a shared latent space by concatenating audio and video latent codes and uses task tokens to support V2A, A2V, and T2AV with a single parameter set. The approach leverages large language model-based text conditioning, pre-trained VAEs for latent encoding/decoding, and a diffusion backbone with cross-modal attention, achieving competitive results across three tasks and improving audio-visual alignment. Ablation studies show that text prompts and joint generation enhance performance and alignment without task-specific fine-tuning.

Abstract

With the rise of diffusion models, audio-video generation has been revolutionized. However, most existing methods rely on separate modules for each modality, with limited exploration of unified generative architectures. In addition, many are confined to a single task and small-scale datasets. To overcome these limitations, we introduce UniForm, a unified multi-task diffusion transformer that generates both audio and visual modalities in a shared latent space. By using a unified denoising network, UniForm captures the inherent correlations between sound and vision. Additionally, we propose task-specific noise schemes and task tokens, enabling the model to support multiple tasks with a single set of parameters, including video-to-audio, audio-to-video and text-to-audio-video generation. Furthermore, by leveraging large language models and a large-scale text-audio-video combined dataset, UniForm achieves greater generative diversity than prior approaches. Experiments show that UniForm achieves performance close to the state-of-the-art single-task models across three generation tasks, with generated content that is not only highly aligned with real-world data distributions but also enables more diverse and fine-grained generation.

UniForm: A Unified Multi-Task Diffusion Transformer for Audio-Video Generation

TL;DR

UniForm tackles the challenge of cross-modal generation by unifying audio and video synthesis within a single diffusion-transformer framework. It builds a shared latent space by concatenating audio and video latent codes and uses task tokens to support V2A, A2V, and T2AV with a single parameter set. The approach leverages large language model-based text conditioning, pre-trained VAEs for latent encoding/decoding, and a diffusion backbone with cross-modal attention, achieving competitive results across three tasks and improving audio-visual alignment. Ablation studies show that text prompts and joint generation enhance performance and alignment without task-specific fine-tuning.

Abstract

With the rise of diffusion models, audio-video generation has been revolutionized. However, most existing methods rely on separate modules for each modality, with limited exploration of unified generative architectures. In addition, many are confined to a single task and small-scale datasets. To overcome these limitations, we introduce UniForm, a unified multi-task diffusion transformer that generates both audio and visual modalities in a shared latent space. By using a unified denoising network, UniForm captures the inherent correlations between sound and vision. Additionally, we propose task-specific noise schemes and task tokens, enabling the model to support multiple tasks with a single set of parameters, including video-to-audio, audio-to-video and text-to-audio-video generation. Furthermore, by leveraging large language models and a large-scale text-audio-video combined dataset, UniForm achieves greater generative diversity than prior approaches. Experiments show that UniForm achieves performance close to the state-of-the-art single-task models across three generation tasks, with generated content that is not only highly aligned with real-world data distributions but also enables more diverse and fine-grained generation.

Paper Structure

This paper contains 26 sections, 13 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Illustration of multimodal-conditioned audio-video generation. Text can create audio-video directly; audio or video can serve as a condition to guide the generation of the other.
  • Figure 2: Overview of the proposed UniForm. Vision tokens and audio tokens are integrated and processed within a unified latent space using a DiT model to learn their representations. During training, one of three tasks is randomly selected in each iteration, with task tokens guiding the learning of the DiT. The text encoder, the encoder-decoder for video and audio, and the audio vocoder are all pre-trained models that remain frozen throughout.
  • Figure 3: Compared with FoleyCrafter in V2A generation on the VGGSound dataset. Our method can generate more accurate prosody and richer high-frequency details.
  • Figure 4: Generated samples in the A2V task on the Landscape dataset.
  • Figure 5: Generated samples in the T2AV task on the Landscape dataset.
  • ...and 2 more figures