Table of Contents
Fetching ...

Discrete Contrastive Diffusion for Cross-Modal Music and Image Generation

Ye Zhu, Yu Wu, Kyle Olszewski, Jian Ren, Sergey Tulyakov, Yan Yan

TL;DR

This work addresses the challenge of ensuring strong conditioning-output alignment in cross-modal diffusion generation. It introduces Conditional Discrete Contrastive Diffusion (CDCD), a mutual-information-based objective, and integrates it via two diffusion mechanisms—step-wise parallel diffusion and sample-wise auxiliary diffusion—along with intra- and inter-negative sampling. The CDCD loss explicitly maximizes $I(z_0;c)$ and connects to the conventional variational objective, enabling faster convergence and improved input-output fidelity across dance-to-music, text-to-image, and class-conditioned image synthesis. Empirically, CDCD achieves state-of-the-art or competitive results while reducing the needed diffusion steps by approximately 35–40%, significantly speeding up inference for cross-modal generation.

Abstract

Diffusion probabilistic models (DPMs) have become a popular approach to conditional generation, due to their promising results and support for cross-modal synthesis. A key desideratum in conditional synthesis is to achieve high correspondence between the conditioning input and generated output. Most existing methods learn such relationships implicitly, by incorporating the prior into the variational lower bound. In this work, we take a different route -- we explicitly enhance input-output connections by maximizing their mutual information. To this end, we introduce a Conditional Discrete Contrastive Diffusion (CDCD) loss and design two contrastive diffusion mechanisms to effectively incorporate it into the denoising process, combining the diffusion training and contrastive learning for the first time by connecting it with the conventional variational objectives. We demonstrate the efficacy of our approach in evaluations with diverse multimodal conditional synthesis tasks: dance-to-music generation, text-to-image synthesis, as well as class-conditioned image synthesis. On each, we enhance the input-output correspondence and achieve higher or competitive general synthesis quality. Furthermore, the proposed approach improves the convergence of diffusion models, reducing the number of required diffusion steps by more than 35% on two benchmarks, significantly increasing the inference speed.

Discrete Contrastive Diffusion for Cross-Modal Music and Image Generation

TL;DR

This work addresses the challenge of ensuring strong conditioning-output alignment in cross-modal diffusion generation. It introduces Conditional Discrete Contrastive Diffusion (CDCD), a mutual-information-based objective, and integrates it via two diffusion mechanisms—step-wise parallel diffusion and sample-wise auxiliary diffusion—along with intra- and inter-negative sampling. The CDCD loss explicitly maximizes and connects to the conventional variational objective, enabling faster convergence and improved input-output fidelity across dance-to-music, text-to-image, and class-conditioned image synthesis. Empirically, CDCD achieves state-of-the-art or competitive results while reducing the needed diffusion steps by approximately 35–40%, significantly speeding up inference for cross-modal generation.

Abstract

Diffusion probabilistic models (DPMs) have become a popular approach to conditional generation, due to their promising results and support for cross-modal synthesis. A key desideratum in conditional synthesis is to achieve high correspondence between the conditioning input and generated output. Most existing methods learn such relationships implicitly, by incorporating the prior into the variational lower bound. In this work, we take a different route -- we explicitly enhance input-output connections by maximizing their mutual information. To this end, we introduce a Conditional Discrete Contrastive Diffusion (CDCD) loss and design two contrastive diffusion mechanisms to effectively incorporate it into the denoising process, combining the diffusion training and contrastive learning for the first time by connecting it with the conventional variational objectives. We demonstrate the efficacy of our approach in evaluations with diverse multimodal conditional synthesis tasks: dance-to-music generation, text-to-image synthesis, as well as class-conditioned image synthesis. On each, we enhance the input-output correspondence and achieve higher or competitive general synthesis quality. Furthermore, the proposed approach improves the convergence of diffusion models, reducing the number of required diffusion steps by more than 35% on two benchmarks, significantly increasing the inference speed.
Paper Structure (28 sections, 10 equations, 7 figures, 10 tables, 1 algorithm)

This paper contains 28 sections, 10 equations, 7 figures, 10 tables, 1 algorithm.

Figures (7)

  • Figure 1: Examples of the input (left column) and synthesized output (middle column) from our contrastive diffusion model for dance-to-music (Rows 1-2), text-to-image (Rows 3-4), and class-conditioned (Row 5) generation experiments on five datasets. The right column shows some synthesized data with reasonable quality but weaker correspondence to the input from existing methods yezhu2022quantizedgangu2021vector.
  • Figure 2: Overview of the proposed pipeline. Our framework includes two major components: a VQ-based encoder-decoder model (top) and a conditioned discrete contrastive diffusion as generative model on the VQ space (bottom). In the contrastive diffusion stage, we illustrate our proposed step-wise parallel diffusion (bottom left) and sample-wise auxiliary diffusion (bottom right). The variables in green denote those from the principal diffusion process, while the variables in red represent the diffusion invoked by negative samples. Here we show audio generation from video input, but demonstrate that this approach extends to different modalities, e.g., text-to-image.
  • Figure 3: Illustration of intra- and inter-negative sampling for music and image data.
  • Figure 4: Convergence analysis in terms of diffusion steps for the dance-to-music task on AIST++ dataset (left) and the text-to-image task on CUB200 dataset (right). We observe that our contrastive diffusion models converge at around 80 steps and 60 steps, resulting 35% steps and 40% steps less compared to the vanilla models that converge at 120 steps and 100 steps, while maintaining superior performance, respectively. We use the same number of steps for training and inference.
  • Figure 5: More qualitative results from our text-to-image experiments on CUB200 dataset. We show examples of the text input (left column), the synthesized images from our contrastive diffusion model with 80 diffusion steps and a FID score of 12.61 (middle column), and the output from existing method gu2021vector with 100 diffusion steps and a FID score 12.97.
  • ...and 2 more figures