Table of Contents
Fetching ...

SongComposer: A Large Language Model for Lyric and Melody Generation in Song Composition

Shuangrui Ding, Zihan Liu, Xiaoyi Dong, Pan Zhang, Rui Qian, Junhao Huang, Conghui He, Dahua Lin, Jiaqi Wang

TL;DR

SongComposer presents a unified large language model for simultaneous lyric and melody generation in symbolic form. It introduces a word-level lyric-melody tuple representation, a scalar pitch initialization strategy, and a three-stage, structure-aware training pipeline to encode motif- and phrase-level song organization. The authors assemble SongCompose, a large bilingual dataset with precise lyric-melody alignments, and demonstrate that SongComposer outperforms GPT-4 and other baselines on lyric-to-melody, melody-to-lyrics, song continuation, and text-to-song tasks, supported by extensive ablations. Limitations are acknowledged regarding audio synthesis and multi-track accompaniment, with future work proposed to bridge symbolic and acoustic generation for end-to-end text-to-song production.

Abstract

Creating lyrics and melodies for the vocal track in a symbolic format, known as song composition, demands expert musical knowledge of melody, an advanced understanding of lyrics, and precise alignment between them. Despite achievements in sub-tasks such as lyric generation, lyric-to-melody, and melody-to-lyric, etc, a unified model for song composition has not yet been achieved. In this paper, we introduce SongComposer, a pioneering step towards a unified song composition model that can readily create symbolic lyrics and melodies following instructions. SongComposer is a music-specialized large language model (LLM) that, for the first time, integrates the capability of simultaneously composing lyrics and melodies into LLMs by leveraging three key innovations: 1) a flexible tuple format for word-level alignment of lyrics and melodies, 2) an extended tokenizer vocabulary for song notes, with scalar initialization based on musical knowledge to capture rhythm, and 3) a multi-stage pipeline that captures musical structure, starting with motif-level melody patterns and progressing to phrase-level structure for improved coherence. Extensive experiments demonstrate that SongComposer outperforms advanced LLMs, including GPT-4, in tasks such as lyric-to-melody generation, melody-to-lyric generation, song continuation, and text-to-song creation. Moreover, we will release SongCompose, a large-scale dataset for training, containing paired lyrics and melodies in Chinese and English.

SongComposer: A Large Language Model for Lyric and Melody Generation in Song Composition

TL;DR

SongComposer presents a unified large language model for simultaneous lyric and melody generation in symbolic form. It introduces a word-level lyric-melody tuple representation, a scalar pitch initialization strategy, and a three-stage, structure-aware training pipeline to encode motif- and phrase-level song organization. The authors assemble SongCompose, a large bilingual dataset with precise lyric-melody alignments, and demonstrate that SongComposer outperforms GPT-4 and other baselines on lyric-to-melody, melody-to-lyrics, song continuation, and text-to-song tasks, supported by extensive ablations. Limitations are acknowledged regarding audio synthesis and multi-track accompaniment, with future work proposed to bridge symbolic and acoustic generation for end-to-end text-to-song production.

Abstract

Creating lyrics and melodies for the vocal track in a symbolic format, known as song composition, demands expert musical knowledge of melody, an advanced understanding of lyrics, and precise alignment between them. Despite achievements in sub-tasks such as lyric generation, lyric-to-melody, and melody-to-lyric, etc, a unified model for song composition has not yet been achieved. In this paper, we introduce SongComposer, a pioneering step towards a unified song composition model that can readily create symbolic lyrics and melodies following instructions. SongComposer is a music-specialized large language model (LLM) that, for the first time, integrates the capability of simultaneously composing lyrics and melodies into LLMs by leveraging three key innovations: 1) a flexible tuple format for word-level alignment of lyrics and melodies, 2) an extended tokenizer vocabulary for song notes, with scalar initialization based on musical knowledge to capture rhythm, and 3) a multi-stage pipeline that captures musical structure, starting with motif-level melody patterns and progressing to phrase-level structure for improved coherence. Extensive experiments demonstrate that SongComposer outperforms advanced LLMs, including GPT-4, in tasks such as lyric-to-melody generation, melody-to-lyric generation, song continuation, and text-to-song creation. Moreover, we will release SongCompose, a large-scale dataset for training, containing paired lyrics and melodies in Chinese and English.
Paper Structure (29 sections, 7 equations, 15 figures, 8 tables)

This paper contains 29 sections, 7 equations, 15 figures, 8 tables.

Figures (15)

  • Figure 1: Overview of the song-related instruction-following composition by SongComposer. SongComposer utilizes symbolic song representation to compose melodies tailored to lyrics, craft lyrics to complement melodies, extend existing songs, and generate new songs from textual prompts.
  • Figure 2: (a) Symbolic song representation involves precise alignment of notes and lyrics; (b) The structure of a song often comprises motif-level and phrase-level concepts.
  • Figure 3: Visualization of attention distribution for different key/query types.
  • Figure 4: Memorization analysis of SongComposer.
  • Figure 5: Pipeline of paired lyric-melody data collection.
  • ...and 10 more figures