Table of Contents
Fetching ...

Intelligent Text-Conditioned Music Generation

Zhouyao Xie, Nikhil Yadala, Xinyi Chen, Jing Xi Liu

TL;DR

This work targets text-conditioned music generation by adapting a CLIP-like alignment framework to bridge natural language and symbolic music. The authors assemble a multimodal dataset by combining Lakh MIDI, Lakh Pianoroll, MuMu, and MSD, generating REMI-formatted representations paired with Amazon-style text labels, and train a text–music alignment model via contrastive learning. A MusicVAE-based generator, guided by a cross-modal encoder and CLIP-style feedback, enables generation of music from text prompts, with evaluation through objective metrics and human judgments. While demonstrating feasibility, the study highlights data scarcity, training challenges, and the need for future improvements (diffusion decoding, better loss functions, multi-instrument support) to achieve practical, high-quality text-conditioned music generation.

Abstract

CLIP (Contrastive Language-Image Pre-Training) is a multimodal neural network trained on (text, image) pairs to predict the most relevant text caption given an image. It has been used extensively in image generation by connecting its output with a generative model such as VQGAN, with the most notable example being OpenAI's DALLE-2. In this project, we apply a similar approach to bridge the gap between natural language and music. Our model is split into two steps: first, we train a CLIP-like model on pairs of text and music over contrastive loss to align a piece of music with its most probable text caption. Then, we combine the alignment model with a music decoder to generate music. To the best of our knowledge, this is the first attempt at text-conditioned deep music generation. Our experiments show that it is possible to train the text-music alignment model using contrastive loss and train a decoder to generate music from text prompts.

Intelligent Text-Conditioned Music Generation

TL;DR

This work targets text-conditioned music generation by adapting a CLIP-like alignment framework to bridge natural language and symbolic music. The authors assemble a multimodal dataset by combining Lakh MIDI, Lakh Pianoroll, MuMu, and MSD, generating REMI-formatted representations paired with Amazon-style text labels, and train a text–music alignment model via contrastive learning. A MusicVAE-based generator, guided by a cross-modal encoder and CLIP-style feedback, enables generation of music from text prompts, with evaluation through objective metrics and human judgments. While demonstrating feasibility, the study highlights data scarcity, training challenges, and the need for future improvements (diffusion decoding, better loss functions, multi-instrument support) to achieve practical, high-quality text-conditioned music generation.

Abstract

CLIP (Contrastive Language-Image Pre-Training) is a multimodal neural network trained on (text, image) pairs to predict the most relevant text caption given an image. It has been used extensively in image generation by connecting its output with a generative model such as VQGAN, with the most notable example being OpenAI's DALLE-2. In this project, we apply a similar approach to bridge the gap between natural language and music. Our model is split into two steps: first, we train a CLIP-like model on pairs of text and music over contrastive loss to align a piece of music with its most probable text caption. Then, we combine the alignment model with a music decoder to generate music. To the best of our knowledge, this is the first attempt at text-conditioned deep music generation. Our experiments show that it is possible to train the text-music alignment model using contrastive loss and train a decoder to generate music from text prompts.
Paper Structure (44 sections, 11 figures, 7 tables)

This paper contains 44 sections, 11 figures, 7 tables.

Figures (11)

  • Figure 1: System Design Diagram
  • Figure 2: Data Design Diagram
  • Figure 3: Feature Engineering Diagram
  • Figure 4: Conditional MultiModal Architecture
  • Figure 5: Contrastive loss based multimodal CLIP
  • ...and 6 more figures