Table of Contents
Fetching ...

Diffusion Models for Joint Audio-Video Generation

Alejandro Paredes La Torre

Abstract

Multimodal generative models have shown remarkable progress in single-modality video and audio synthesis, yet truly joint audio-video generation remains an open challenge. In this paper, I explore four key contributions to advance this field. First, I release two high-quality, paired audio-video datasets. The datasets consisting on 13 hours of video-game clips and 64 hours of concert performances, each segmented into consistent 34-second samples to facilitate reproducible research. Second, I train the MM-Diffusion architecture from scratch on our datasets, demonstrating its ability to produce semantically coherent audio-video pairs and quantitatively evaluating alignment on rapid actions and musical cues. Third, I investigate joint latent diffusion by leveraging pretrained video and audio encoder-decoders, uncovering challenges and inconsistencies in the multimodal decoding stage. Finally, I propose a sequential two-step text-to-audio-video generation pipeline: first generating video, then conditioning on both the video output and the original prompt to synthesize temporally synchronized audio. My experiments show that this modular approach yields high-fidelity generations of audio video generation.

Diffusion Models for Joint Audio-Video Generation

Abstract

Multimodal generative models have shown remarkable progress in single-modality video and audio synthesis, yet truly joint audio-video generation remains an open challenge. In this paper, I explore four key contributions to advance this field. First, I release two high-quality, paired audio-video datasets. The datasets consisting on 13 hours of video-game clips and 64 hours of concert performances, each segmented into consistent 34-second samples to facilitate reproducible research. Second, I train the MM-Diffusion architecture from scratch on our datasets, demonstrating its ability to produce semantically coherent audio-video pairs and quantitatively evaluating alignment on rapid actions and musical cues. Third, I investigate joint latent diffusion by leveraging pretrained video and audio encoder-decoders, uncovering challenges and inconsistencies in the multimodal decoding stage. Finally, I propose a sequential two-step text-to-audio-video generation pipeline: first generating video, then conditioning on both the video output and the original prompt to synthesize temporally synchronized audio. My experiments show that this modular approach yields high-fidelity generations of audio video generation.
Paper Structure (18 sections, 17 equations, 6 figures, 1 table)

This paper contains 18 sections, 17 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: MM-Diffusion unconditional generation. Model trained from scratch on the concert dataset (20k steps). The model is able to capture the semantics of the dataset (lights and human figures). Further training can yield better results.
  • Figure 2: Two-step sequential generation using prompt "A lively street dance battle under neon lights, with dancers showing off impressive moves to an energetic hip-hop beat. The crowd cheers ...". A clear alignment between the motion of dancing and rhythmic patterns reflected on the spectrogram can be observed.
  • Figure 3: Two step generation with prompt: "A crowded street market in a vibrant city, filled with stalls of colorful fruits and handmade goods. Vendors shout out their prices, while...". A clear alignment between the expected noise and the video can be observed.
  • Figure 4: Side-by-side comparison of datasets: Concerts and Gaming.
  • Figure 5: Loss of MM-Diffusion training with my custom concerts dataset. Listed are the coupled Joint U-net Loss, audio loss and video Loss. The model requires many resources to complete, the loss function is very noisy and requires careful hyperparameter setting.
  • ...and 1 more figures