Table of Contents
Fetching ...

Text-to-Song: Towards Controllable Music Generation Incorporating Vocals and Accompaniment

Zhiqing Hong, Rongjie Huang, Xize Cheng, Yongqi Wang, Ruiqi Li, Fuming You, Zhou Zhao, Zhimeng Zhang

TL;DR

This work introduces text-to-song synthesis and presents Melodist, a two-stage framework that first generates singing voice from a music score and then generates accompaniment guided by natural language prompts. It combines acoustic-token based generation (via SoundStream and a unit-based vocoder) with a multi-scale transformer backbone and a tri-tower contrastive pre-training to align text with both vocal and accompaniment patterns. A Mandarin crawled dataset and open-source SVS data support training under data scarcity, with extensive experiments showing Melodist achieves competitive quality and strong prompt-grounding, outperforming several baselines in SVS, accompaniment, and text-to-song tasks. The approach demonstrates the feasibility and controllability of end-to-end text-conditioned song synthesis, offering a new direction for structured, cross-modal music generation with potential applications in content creation and human-in-the-loop music production.

Abstract

A song is a combination of singing voice and accompaniment. However, existing works focus on singing voice synthesis and music generation independently. Little attention was paid to explore song synthesis. In this work, we propose a novel task called text-to-song synthesis which incorporating both vocals and accompaniments generation. We develop Melodist, a two-stage text-to-song method that consists of singing voice synthesis (SVS) and vocal-to-accompaniment (V2A) synthesis. Melodist leverages tri-tower contrastive pretraining to learn more effective text representation for controllable V2A synthesis. A Chinese song dataset mined from a music website is built up to alleviate data scarcity for our research. The evaluation results on our dataset demonstrate that Melodist can synthesize songs with comparable quality and style consistency. Audio samples can be found in https://text2songMelodist.github.io/Sample/.

Text-to-Song: Towards Controllable Music Generation Incorporating Vocals and Accompaniment

TL;DR

This work introduces text-to-song synthesis and presents Melodist, a two-stage framework that first generates singing voice from a music score and then generates accompaniment guided by natural language prompts. It combines acoustic-token based generation (via SoundStream and a unit-based vocoder) with a multi-scale transformer backbone and a tri-tower contrastive pre-training to align text with both vocal and accompaniment patterns. A Mandarin crawled dataset and open-source SVS data support training under data scarcity, with extensive experiments showing Melodist achieves competitive quality and strong prompt-grounding, outperforming several baselines in SVS, accompaniment, and text-to-song tasks. The approach demonstrates the feasibility and controllability of end-to-end text-conditioned song synthesis, offering a new direction for structured, cross-modal music generation with potential applications in content creation and human-in-the-loop music production.

Abstract

A song is a combination of singing voice and accompaniment. However, existing works focus on singing voice synthesis and music generation independently. Little attention was paid to explore song synthesis. In this work, we propose a novel task called text-to-song synthesis which incorporating both vocals and accompaniments generation. We develop Melodist, a two-stage text-to-song method that consists of singing voice synthesis (SVS) and vocal-to-accompaniment (V2A) synthesis. Melodist leverages tri-tower contrastive pretraining to learn more effective text representation for controllable V2A synthesis. A Chinese song dataset mined from a music website is built up to alleviate data scarcity for our research. The evaluation results on our dataset demonstrate that Melodist can synthesize songs with comparable quality and style consistency. Audio samples can be found in https://text2songMelodist.github.io/Sample/.
Paper Structure (50 sections, 6 equations, 5 figures, 8 tables)

This paper contains 50 sections, 6 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: The comparison of three tasks: singing voice synthesis, accompaniment generation and text-to-song. In this work, We investigate on the relationship between vocal and accompaniment for text-to-song synthesis.
  • Figure 2: The overview of Melodist, the proposed two-stage text-to-song synthesis model. We present the two-stage pipeline in subfigure (a). In subfigure (b), we present the multi-scale Transformer architecture, in which e and $z_t^k$ denote <EOS> token and the k-th audio token at t-th frame, respectively.
  • Figure 3: The architecture of the tri-tower contrastive framework. $Z_P$, $Z_V$, $Z_A$ refer to the representation extracted by the text encoder, the vocal encoder and the accompaniment encoder, respectively. We use different shapes to represent different triples, while color is used to distinguish the kinds of inputs. Embeddings of the same triplet are pulled closer, while those of different objects are pushed away in the joint embedding space.
  • Figure 4: Screenshot of MOS testing.
  • Figure 5: Screenshot of SMOS testing.