Table of Contents
Fetching ...

Music Consistency Models

Zhengcong Fei, Mingyuan Fan, Junshi Huang

TL;DR

MusicCM tackles the high sampling cost of diffusion-based music generation by transferring the efficiency of consistency models to the domain of mel-spectrogram synthesis. It combines latent-space consistency distillation with an adversarial discriminator and introduces a shared restricted diffusion approach to produce coherent long-form music from multiple diffusion processes. Empirical results show competitive generation quality while reducing inference steps to 4–6, achieving roughly 1 second per minute of music on a single A100 GPU. This yields a practical baseline for fast, near real-time text-to-music synthesis and highlights the potential of consistency-model techniques in audio domains.

Abstract

Consistency models have exhibited remarkable capabilities in facilitating efficient image/video generation, enabling synthesis with minimal sampling steps. It has proven to be advantageous in mitigating the computational burdens associated with diffusion models. Nevertheless, the application of consistency models in music generation remains largely unexplored. To address this gap, we present Music Consistency Models (\texttt{MusicCM}), which leverages the concept of consistency models to efficiently synthesize mel-spectrogram for music clips, maintaining high quality while minimizing the number of sampling steps. Building upon existing text-to-music diffusion models, the \texttt{MusicCM} model incorporates consistency distillation and adversarial discriminator training. Moreover, we find it beneficial to generate extended coherent music by incorporating multiple diffusion processes with shared constraints. Experimental results reveal the effectiveness of our model in terms of computational efficiency, fidelity, and naturalness. Notable, \texttt{MusicCM} achieves seamless music synthesis with a mere four sampling steps, e.g., only one second per minute of the music clip, showcasing the potential for real-time application.

Music Consistency Models

TL;DR

MusicCM tackles the high sampling cost of diffusion-based music generation by transferring the efficiency of consistency models to the domain of mel-spectrogram synthesis. It combines latent-space consistency distillation with an adversarial discriminator and introduces a shared restricted diffusion approach to produce coherent long-form music from multiple diffusion processes. Empirical results show competitive generation quality while reducing inference steps to 4–6, achieving roughly 1 second per minute of music on a single A100 GPU. This yields a practical baseline for fast, near real-time text-to-music synthesis and highlights the potential of consistency-model techniques in audio domains.

Abstract

Consistency models have exhibited remarkable capabilities in facilitating efficient image/video generation, enabling synthesis with minimal sampling steps. It has proven to be advantageous in mitigating the computational burdens associated with diffusion models. Nevertheless, the application of consistency models in music generation remains largely unexplored. To address this gap, we present Music Consistency Models (\texttt{MusicCM}), which leverages the concept of consistency models to efficiently synthesize mel-spectrogram for music clips, maintaining high quality while minimizing the number of sampling steps. Building upon existing text-to-music diffusion models, the \texttt{MusicCM} model incorporates consistency distillation and adversarial discriminator training. Moreover, we find it beneficial to generate extended coherent music by incorporating multiple diffusion processes with shared constraints. Experimental results reveal the effectiveness of our model in terms of computational efficiency, fidelity, and naturalness. Notable, \texttt{MusicCM} achieves seamless music synthesis with a mere four sampling steps, e.g., only one second per minute of the music clip, showcasing the potential for real-time application.
Paper Structure (27 sections, 9 equations, 3 figures, 4 tables, 1 algorithm)

This paper contains 27 sections, 9 equations, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of music consistency models. Given a source music mel-spectrogram $x_0$, a forward diffusion operation is first performed to add noise to the music. Then, the noised $x_{n+k}$ is entered into the student and teacher model to predict music clips. $\hat{x}_n$ is estimated by the teacher model and fed into the EMA student model. To learn self-consistency, a distillation loss is imposed to constrain the output of the two student models to be consistent, and an adversarial loss is used to fool a discriminator which is trained to distinguish the generated samples $x_0^{pred}$ from real music $x_0$. The whole consistency distillation is conducted in the latent space, and conditional guidance is omitted for ease of presentation. The teacher model is a music diffusion model, and the student shares the same network structure as the teacher model and is initialized with the parameters of the teacher model.
  • Figure 2: Comparison of long music generation through independent paths vs. shared restricted paths. Input text prompt: Bright, cheerful and groovy piano, classical. As expected, there is no coherency between clips in independent; Starting from the same noise, our shared restriction process steers these initial diffusion paths into consistent and high quality music clips.
  • Figure 3: Qualitative visualization results under different inference steps. Larger steps generally yield better quality and time continuity of music. Importantly, our MusicCM can produce plausible results with fewer steps or even only one step.