Table of Contents
Fetching ...

MM-LDM: Multi-Modal Latent Diffusion Model for Sounding Video Generation

Mingzhen Sun, Weining Wang, Yanyuan Qiao, Jiahui Sun, Zihan Qin, Longteng Guo, Xinxin Zhu, Jing Liu

TL;DR

A novel multi-modal latent diffusion model (MM-LDM) is introduced for the SVG task that constructs a low-level perceptual latent space for each modality and a shared high-level semantic feature space to bridge the information gap between modalities.

Abstract

Sounding Video Generation (SVG) is an audio-video joint generation task challenged by high-dimensional signal spaces, distinct data formats, and different patterns of content information. To address these issues, we introduce a novel multi-modal latent diffusion model (MM-LDM) for the SVG task. We first unify the representation of audio and video data by converting them into a single or a couple of images. Then, we introduce a hierarchical multi-modal autoencoder that constructs a low-level perceptual latent space for each modality and a shared high-level semantic feature space. The former space is perceptually equivalent to the raw signal space of each modality but drastically reduces signal dimensions. The latter space serves to bridge the information gap between modalities and provides more insightful cross-modal guidance. Our proposed method achieves new state-of-the-art results with significant quality and efficiency gains. Specifically, our method achieves a comprehensive improvement on all evaluation metrics and a faster training and sampling speed on Landscape and AIST++ datasets. Moreover, we explore its performance on open-domain sounding video generation, long sounding video generation, audio continuation, video continuation, and conditional single-modal generation tasks for a comprehensive evaluation, where our MM-LDM demonstrates exciting adaptability and generalization ability.

MM-LDM: Multi-Modal Latent Diffusion Model for Sounding Video Generation

TL;DR

A novel multi-modal latent diffusion model (MM-LDM) is introduced for the SVG task that constructs a low-level perceptual latent space for each modality and a shared high-level semantic feature space to bridge the information gap between modalities.

Abstract

Sounding Video Generation (SVG) is an audio-video joint generation task challenged by high-dimensional signal spaces, distinct data formats, and different patterns of content information. To address these issues, we introduce a novel multi-modal latent diffusion model (MM-LDM) for the SVG task. We first unify the representation of audio and video data by converting them into a single or a couple of images. Then, we introduce a hierarchical multi-modal autoencoder that constructs a low-level perceptual latent space for each modality and a shared high-level semantic feature space. The former space is perceptually equivalent to the raw signal space of each modality but drastically reduces signal dimensions. The latter space serves to bridge the information gap between modalities and provides more insightful cross-modal guidance. Our proposed method achieves new state-of-the-art results with significant quality and efficiency gains. Specifically, our method achieves a comprehensive improvement on all evaluation metrics and a faster training and sampling speed on Landscape and AIST++ datasets. Moreover, we explore its performance on open-domain sounding video generation, long sounding video generation, audio continuation, video continuation, and conditional single-modal generation tasks for a comprehensive evaluation, where our MM-LDM demonstrates exciting adaptability and generalization ability.
Paper Structure (24 sections, 8 equations, 5 figures, 6 tables)

This paper contains 24 sections, 8 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Sounding videos generated by our MM-LDM on the Landscape dataset landscape. We can observe vivid scenes like (a) mountain, (c) diving man, (e) lake, and so on. Matched audios are given like the sound of (b) wood burning, (d) sea wave, (f) raining, and so on. All presented audios (in this paper) can be played in Adobe Acrobat by clicking corresponding wave figures. More playable sounding video samples can be found in https://iva-mzsun.github.io/MM-LDM.
  • Figure 2: Overall illustration of our multi-modal latent diffusion model (MM-LDM) framework. Modules with gray border comprise our hierarchical multi-modal autoencoder. The module with orange border is our transformer-based diffusion model that performs SVG in the latent space. The green rectangle depicts the modification of inputs for unconditional audio-video generation (i.e. SVG), audio-to-video generation, and video-to-audio generation, respectively.
  • Figure 3: The detailed architecture of our multi-modal autoencoder. (a) Given audio and video inputs, two modal-specific encoders learn their perceptual latents. Two projectors map from two respective perceptual latent spaces to the shared semantic space. $\mathcal{L}_{cl}$ represents the classification loss and $\mathcal{L}_{co}$ denotes the contrastive loss. (b) We share the decoder parameters and incorporate multiple conditional information for signal decoding. For the video modality, we provide a specific input of frame index to extract information of the target video frame.
  • Figure 4: Qualitative comparison of sounding video samples: MM-Diffusion vs. MM-LDM (ours). All presented audios can be played in Adobe Acrobat by clicking corresponding wave figures.
  • Figure 5: Samples of (a) long sounding video generation, (b) video-to-audio generation, and (c) audio-to-video generation tasks.