Table of Contents
Fetching ...

CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling

Ruihan Yang, Hannes Gamper, Sebastian Braun

TL;DR

The findings demonstrate that the proposed model outperforms the baseline in terms of quality and generation speed through introduction of the novel cross-modal easy fusion architectural block and the incorporation of the contrastive loss results in improvements in audio-visual alignment.

Abstract

We introduce a multi-modal diffusion model tailored for the bi-directional conditional generation of video and audio. We propose a joint contrastive training loss to improve the synchronization between visual and auditory occurrences. We present experiments on two datasets to evaluate the efficacy of our proposed model. The assessment of generation quality and alignment performance is carried out from various angles, encompassing both objective and subjective metrics. Our findings demonstrate that the proposed model outperforms the baseline in terms of quality and generation speed through introduction of our novel cross-modal easy fusion architectural block. Furthermore, the incorporation of the contrastive loss results in improvements in audio-visual alignment, particularly in the high-correlation video-to-audio generation task.

CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling

TL;DR

The findings demonstrate that the proposed model outperforms the baseline in terms of quality and generation speed through introduction of the novel cross-modal easy fusion architectural block and the incorporation of the contrastive loss results in improvements in audio-visual alignment.

Abstract

We introduce a multi-modal diffusion model tailored for the bi-directional conditional generation of video and audio. We propose a joint contrastive training loss to improve the synchronization between visual and auditory occurrences. We present experiments on two datasets to evaluate the efficacy of our proposed model. The assessment of generation quality and alignment performance is carried out from various angles, encompassing both objective and subjective metrics. Our findings demonstrate that the proposed model outperforms the baseline in terms of quality and generation speed through introduction of our novel cross-modal easy fusion architectural block. Furthermore, the incorporation of the contrastive loss results in improvements in audio-visual alignment, particularly in the high-correlation video-to-audio generation task.
Paper Structure (37 sections, 7 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 37 sections, 7 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overview of our proposed architecture and method. The detailed implementation of each U-Net block is depicted in the upper right corner and the intuition of our design choice of easy fusion is available in Appendix \ref{['sec:architecture']}. Training of the diffusion model is performed on latent-spectrogram space.
  • Figure 2: Conditioning video (top) with ground truth spectrogram below. The two bottom spectrograms show the generated audio with CMMD and nCMMD conditioned on the video. Sound events are highlighted with a green circle for matches and a red circle for mismatches.
  • Figure 3: Generated video with CMMD conditioned on the audio spectrogram.
  • Figure 4: Per-sample (boxes) and per-set ($\times$) Frechet audio distance (FAD) results for AIST++ (left) and EPIC-Sound (right). FAD is calculated for 50 output samples of each model using CLAP embeddings with the respective test set as reference. Boxes show the per-sample FAD distribution of these 50 samples, with red markers indicating outliers beyond the whiskers which extend to 1.5 times the interquartile range. Note that the per-set FAD scores for ground truth (gt) are larger than zero as only the small subset of the test set used in the evaluation is compared to the whole test set used as reference. Comparing FAD scores for identical set sizes avoids sample size bias gui2023fad.
  • Figure 5: Subjective results from user study for EPIC-Sound video conditioned audio generation (left), AIST++ dance video conditioned audio generation (center), and audio conditioned video generation (right).
  • ...and 2 more figures