Table of Contents
Fetching ...

Enhancing Dance-to-Music Generation via Negative Conditioning Latent Diffusion Model

Changchang Sun, Gaowen Liu, Charles Fleming, Yan Yan

TL;DR

This work tackles the challenging task of generating music synchronized to dance videos. It proposes PN-Diffusion, a latent diffusion framework that employs both positive conditioning from forward dance cues and negative conditioning from reverse-played cues, learned through dual forward diffusion and dual reverse denoising with a bi-directional objective. By compressing Mel-spectrograms into a latent space and conditioning a Stable Diffusion–style U-Net on I3D and ST-GCN features, the method achieves superior beat alignment and perceptual music quality on AIST++ and TikTok datasets. The results—supported by objective metrics and human judgments—demonstrate notable improvements over state-of-the-art baselines and highlight the practical potential for rhythmically coherent cross-modal content generation in social media contexts.

Abstract

Conditional diffusion models have gained increasing attention since their impressive results for cross-modal synthesis, where the strong alignment between conditioning input and generated output can be achieved by training a time-conditioned U-Net augmented with cross-attention mechanism. In this paper, we focus on the problem of generating music synchronized with rhythmic visual cues of the given dance video. Considering that bi-directional guidance is more beneficial for training a diffusion model, we propose to enhance the quality of generated music and its synchronization with dance videos by adopting both positive rhythmic information and negative ones (PN-Diffusion) as conditions, where a dual diffusion and reverse processes is devised. Specifically, to train a sequential multi-modal U-Net structure, PN-Diffusion consists of a noise prediction objective for positive conditioning and an additional noise prediction objective for negative conditioning. To accurately define and select both positive and negative conditioning, we ingeniously utilize temporal correlations in dance videos, capturing positive and negative rhythmic cues by playing them forward and backward, respectively. Through subjective and objective evaluations of input-output correspondence in terms of dance-music beat alignment and the quality of generated music, experimental results on the AIST++ and TikTok dance video datasets demonstrate that our model outperforms SOTA dance-to-music generation models.

Enhancing Dance-to-Music Generation via Negative Conditioning Latent Diffusion Model

TL;DR

This work tackles the challenging task of generating music synchronized to dance videos. It proposes PN-Diffusion, a latent diffusion framework that employs both positive conditioning from forward dance cues and negative conditioning from reverse-played cues, learned through dual forward diffusion and dual reverse denoising with a bi-directional objective. By compressing Mel-spectrograms into a latent space and conditioning a Stable Diffusion–style U-Net on I3D and ST-GCN features, the method achieves superior beat alignment and perceptual music quality on AIST++ and TikTok datasets. The results—supported by objective metrics and human judgments—demonstrate notable improvements over state-of-the-art baselines and highlight the practical potential for rhythmically coherent cross-modal content generation in social media contexts.

Abstract

Conditional diffusion models have gained increasing attention since their impressive results for cross-modal synthesis, where the strong alignment between conditioning input and generated output can be achieved by training a time-conditioned U-Net augmented with cross-attention mechanism. In this paper, we focus on the problem of generating music synchronized with rhythmic visual cues of the given dance video. Considering that bi-directional guidance is more beneficial for training a diffusion model, we propose to enhance the quality of generated music and its synchronization with dance videos by adopting both positive rhythmic information and negative ones (PN-Diffusion) as conditions, where a dual diffusion and reverse processes is devised. Specifically, to train a sequential multi-modal U-Net structure, PN-Diffusion consists of a noise prediction objective for positive conditioning and an additional noise prediction objective for negative conditioning. To accurately define and select both positive and negative conditioning, we ingeniously utilize temporal correlations in dance videos, capturing positive and negative rhythmic cues by playing them forward and backward, respectively. Through subjective and objective evaluations of input-output correspondence in terms of dance-music beat alignment and the quality of generated music, experimental results on the AIST++ and TikTok dance video datasets demonstrate that our model outperforms SOTA dance-to-music generation models.

Paper Structure

This paper contains 20 sections, 5 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Illustration of our proposed PN-Diffusion. The rich temporal synchronization information conveyed by the normal forward-played dance videos and their reverse-played counterparts are extracted by visual and motion encoders. A dual diffusion process and a dual reserve process are introduced to better realize the temporal correlation and rhythmic consistency between dance video and music, and a bi-directional denoising objective is designed to train the diffusion model.
  • Figure 2: Overview of the U-Net structure of PN-Diffusion model, where we adapt the U-Net of stable diffusion and use both positive and negative conditioning as input in the dual reverse process.
  • Figure 3: Sensitivity analysis of the hyper-parameters $\alpha$.
  • Figure 4: Waveform comparison of generated and GT music.
  • Figure 5: Mel spectrogram Visualizations of GT dancing music and generated dancing music by PN-Diffusion and LORIS, where the areas with high intensity and energy are highlighted in green color boxes. Compared with LORIS, our generated music closely match the ground truth in terms of timing.