Table of Contents
Fetching ...

Melody-Guided Music Generation

Shaopeng Wei, Manzhen Wei, Haoyu Wang, Yu Zhao, Gang Kou

TL;DR

MG2 tackles the challenge of producing harmonically coherent text-described music by introducing melody-guided generation. It combines a Contrastive Language-Music Pretraining (CLMP) that implicitly aligns text, waveform, and melody with a retrieval-augmented diffusion module that explicitly conditions on retrieved melody guidance, plus a decoding stage using a VAE and a vocoder. On MusicCaps and MusicBench, MG2 outperforms open-source baselines while using far fewer parameters and training data, demonstrating data efficiency and better musical harmony. Comprehensive human evaluations across users, musicians, and short-video creators reveal strong recognizability, text relevance, satisfaction, quality, and market potential, underscoring MG2’s practical applicability in content creation and media production.

Abstract

We present the Melody-Guided Music Generation (MG2) model, a novel approach using melody to guide the text-to-music generation that, despite a simple method and limited resources, achieves excellent performance. Specifically, we first align the text with audio waveforms and their associated melodies using the newly proposed Contrastive Language-Music Pretraining, enabling the learned text representation fused with implicit melody information. Subsequently, we condition the retrieval-augmented diffusion module on both text prompt and retrieved melody. This allows MG2 to generate music that reflects the content of the given text description, meantime keeping the intrinsic harmony under the guidance of explicit melody information. We conducted extensive experiments on two public datasets: MusicCaps and MusicBench. Surprisingly, the experimental results demonstrate that the proposed MG2 model surpasses current open-source text-to-music generation models, achieving this with fewer than 1/3 of the parameters or less than 1/200 of the training data compared to state-of-the-art counterparts. Furthermore, we conducted comprehensive human evaluations involving three types of users and five perspectives, using newly designed questionnaires to explore the potential real-world applications of MG2.

Melody-Guided Music Generation

TL;DR

MG2 tackles the challenge of producing harmonically coherent text-described music by introducing melody-guided generation. It combines a Contrastive Language-Music Pretraining (CLMP) that implicitly aligns text, waveform, and melody with a retrieval-augmented diffusion module that explicitly conditions on retrieved melody guidance, plus a decoding stage using a VAE and a vocoder. On MusicCaps and MusicBench, MG2 outperforms open-source baselines while using far fewer parameters and training data, demonstrating data efficiency and better musical harmony. Comprehensive human evaluations across users, musicians, and short-video creators reveal strong recognizability, text relevance, satisfaction, quality, and market potential, underscoring MG2’s practical applicability in content creation and media production.

Abstract

We present the Melody-Guided Music Generation (MG2) model, a novel approach using melody to guide the text-to-music generation that, despite a simple method and limited resources, achieves excellent performance. Specifically, we first align the text with audio waveforms and their associated melodies using the newly proposed Contrastive Language-Music Pretraining, enabling the learned text representation fused with implicit melody information. Subsequently, we condition the retrieval-augmented diffusion module on both text prompt and retrieved melody. This allows MG2 to generate music that reflects the content of the given text description, meantime keeping the intrinsic harmony under the guidance of explicit melody information. We conducted extensive experiments on two public datasets: MusicCaps and MusicBench. Surprisingly, the experimental results demonstrate that the proposed MG2 model surpasses current open-source text-to-music generation models, achieving this with fewer than 1/3 of the parameters or less than 1/200 of the training data compared to state-of-the-art counterparts. Furthermore, we conducted comprehensive human evaluations involving three types of users and five perspectives, using newly designed questionnaires to explore the potential real-world applications of MG2.
Paper Structure (38 sections, 12 equations, 8 figures, 8 tables)

This paper contains 38 sections, 12 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: The architecture of the proposed $\text{MG}^2$ comprises three main components: (a) CLMP: This module leverages melody information implicitly by aligning the waveform, melody, and text descriptions within a unified vector space; (b) Retrieval-augmented Diffusion Module: This module generates a music latent vector based on the given text description. It first constructs a melody vector database using previously trained melody representations and retrieves a melody as guidance, and explicitly combines it with the input query to condition the latent diffusion model; (c) Decoding Module: Finally, the decoding module, incorporating a Variational Autoencoder (VAE) and a Vocoder, synthesizes the playable music.
  • Figure 2: Illustration of accurate alignment.
  • Figure 3: Ablation Study.
  • Figure 4: Illustration of melody guidance.
  • Figure 5: Illustration of accurate semantic information understanding.
  • ...and 3 more figures