Table of Contents
Fetching ...

Integrating Text-to-Music Models with Language Models: Composing Long Structured Music Pieces

Lilac Atassi

TL;DR

The experimental results show that the proposed text-to-music model with a large language model to generate music with form can generate 2.5-minute-long music that is highly structured, strongly organized, and cohesive.

Abstract

Recent music generation methods based on transformers have a context window of up to a minute. The music generated by these methods is largely unstructured beyond the context window. With a longer context window, learning long-scale structures from musical data is a prohibitively challenging problem. This paper proposes integrating a text-to-music model with a large language model to generate music with form. The papers discusses the solutions to the challenges of such integration. The experimental results show that the proposed method can generate 2.5-minute-long music that is highly structured, strongly organized, and cohesive.

Integrating Text-to-Music Models with Language Models: Composing Long Structured Music Pieces

TL;DR

The experimental results show that the proposed text-to-music model with a large language model to generate music with form can generate 2.5-minute-long music that is highly structured, strongly organized, and cohesive.

Abstract

Recent music generation methods based on transformers have a context window of up to a minute. The music generated by these methods is largely unstructured beyond the context window. With a longer context window, learning long-scale structures from musical data is a prohibitively challenging problem. This paper proposes integrating a text-to-music model with a large language model to generate music with form. The papers discusses the solutions to the challenges of such integration. The experimental results show that the proposed method can generate 2.5-minute-long music that is highly structured, strongly organized, and cohesive.
Paper Structure (6 sections, 4 figures)

This paper contains 6 sections, 4 figures.

Figures (4)

  • Figure 1: The mean (top row) and the variance (bottom row) of the fused self-similarity (SS) matrices estimated by 100 samples from Pond5, generated by our method, and by MusicGen. The SS matrices are downsampled to $5\times5$. The results indicate that, compared to MusicGen, our method produces samples that more closely resemble the Pond5 samples in terms of long-term temporal consistency and the diversity of recurring sections.
  • Figure 2: Illustrating the incoherence in images generated by Dall-E 3, Midjourney, and Meta AI. These inconsistencies are evident in images featuring mirrors and wavering flags. Notice the forked or merged stripes on the flags and the inconsistent reflection and incidence angles in mirrors, among the other inconsistencies.
  • Figure 3: Visualizing the self similarity matrices for 3 MusicGen samples, one sample from our method and one from Pond5. With MusicGen, at a low temperature (T) of 0.1, the music is repetetive. At T=5.0, there is mostly random noise. At T=1, the music is meandering. The sample from our method resembles the one from Pond5, composed and arranged by a musician.
  • Figure 4: Left: The subjective comparison of the generated music and sampled from Pond5 by non-musicians is measured through the MOS based on how engaging the music is. Whiskers: 95% CI. Right: The subjective comparison of the samples by musicians, critiquing the musical structures.