Text-to-Song: Towards Controllable Music Generation Incorporating Vocals and Accompaniment
Zhiqing Hong, Rongjie Huang, Xize Cheng, Yongqi Wang, Ruiqi Li, Fuming You, Zhou Zhao, Zhimeng Zhang
TL;DR
This work introduces text-to-song synthesis and presents Melodist, a two-stage framework that first generates singing voice from a music score and then generates accompaniment guided by natural language prompts. It combines acoustic-token based generation (via SoundStream and a unit-based vocoder) with a multi-scale transformer backbone and a tri-tower contrastive pre-training to align text with both vocal and accompaniment patterns. A Mandarin crawled dataset and open-source SVS data support training under data scarcity, with extensive experiments showing Melodist achieves competitive quality and strong prompt-grounding, outperforming several baselines in SVS, accompaniment, and text-to-song tasks. The approach demonstrates the feasibility and controllability of end-to-end text-conditioned song synthesis, offering a new direction for structured, cross-modal music generation with potential applications in content creation and human-in-the-loop music production.
Abstract
A song is a combination of singing voice and accompaniment. However, existing works focus on singing voice synthesis and music generation independently. Little attention was paid to explore song synthesis. In this work, we propose a novel task called text-to-song synthesis which incorporating both vocals and accompaniments generation. We develop Melodist, a two-stage text-to-song method that consists of singing voice synthesis (SVS) and vocal-to-accompaniment (V2A) synthesis. Melodist leverages tri-tower contrastive pretraining to learn more effective text representation for controllable V2A synthesis. A Chinese song dataset mined from a music website is built up to alleviate data scarcity for our research. The evaluation results on our dataset demonstrate that Melodist can synthesize songs with comparable quality and style consistency. Audio samples can be found in https://text2songMelodist.github.io/Sample/.
