MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies

Ke Chen; Yusong Wu; Haohe Liu; Marianna Nezhurina; Taylor Berg-Kirkpatrick; Shlomo Dubnov

MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies

Ke Chen, Yusong Wu, Haohe Liu, Marianna Nezhurina, Taylor Berg-Kirkpatrick, Shlomo Dubnov

TL;DR

This work tackles text-to-music generation under data scarcity and plagiarism concerns by introducing MusicLDM, a diffusion-based system built on Stable Diffusion and AudioLDM with music-specific CLAP and HiFi-GAN retraining. It further advances data augmentation through beat-synchronous mixup strategies, BAM and BLM, guided by Beat Transformer to interpolate within the music manifold. Empirical results on Audiostock demonstrate that latent-space mixing (BLM) offers the best trade-off among generation quality, text-audio relevance, and novelty, while beat-aware augmentation reduces copying. The approach provides a practical path toward more diverse and faithful text-to-music synthesis with publicly accessible code and models.

Abstract

Diffusion models have shown promising results in cross-modal generation tasks, including text-to-image and text-to-audio generation. However, generating music, as a special type of audio, presents unique challenges due to limited availability of music data and sensitive issues related to copyright and plagiarism. In this paper, to tackle these challenges, we first construct a state-of-the-art text-to-music model, MusicLDM, that adapts Stable Diffusion and AudioLDM architectures to the music domain. We achieve this by retraining the contrastive language-audio pretraining model (CLAP) and the Hifi-GAN vocoder, as components of MusicLDM, on a collection of music data samples. Then, to address the limitations of training data and to avoid plagiarism, we leverage a beat tracking model and propose two different mixup strategies for data augmentation: beat-synchronous audio mixup and beat-synchronous latent mixup, which recombine training audio directly or via a latent embeddings space, respectively. Such mixup strategies encourage the model to interpolate between musical training samples and generate new music within the convex hull of the training data, making the generated music more diverse while still staying faithful to the corresponding style. In addition to popular evaluation metrics, we design several new evaluation metrics based on CLAP score to demonstrate that our proposed MusicLDM and beat-synchronous mixup strategies improve both the quality and novelty of generated music, as well as the correspondence between input text and generated music.

MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies

TL;DR

Abstract

Paper Structure (35 sections, 3 equations, 5 figures, 3 tables)

This paper contains 35 sections, 3 equations, 5 figures, 3 tables.

Introduction
Related Work
Text-to-Audio Generation
Plagiarism on Diffusion Models
Mixup on Data Augmentation
Methodology
Beat-Synchronous Mixup
Beat-tracking via Beat Transformer
Beat-Synchronous Audio Mixup
Beat-Synchronous Latent Mixup
What are BAM and BLM doing?
Experiments
Experimental Setup
Dataset
Hyperparameters and Training Details
...and 20 more sections

Figures (5)

Figure 1: The architecture of MusicLDM, which contains a contrastive language-audio pretraining (CLAP) model, an audio latent diffusion model with VAE, and a Hifi-GAN nerual vocoder.
Figure 2: Mixup strategies. Left: tempo grouping and downbeat alignment via Beat Transformer. Middle: BAM and BLM mixup strategies. Right: How BAM and BLM are applied in the feature space of audio signals and audio latent variables.
Figure 3: The violin plot of the audio-audio similarity, and the text-to-audio similarity.
Figure 4: The spectrograms of music pairs indicated by high cosine similiarity score of CLAP audio embeddings.
Figure 5: The spectrograms of music pairs indicated by low cosine similarity score of CLAP audio embeddings.

MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies

TL;DR

Abstract

MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies

Authors

TL;DR

Abstract

Table of Contents

Figures (5)