Table of Contents
Fetching ...

Accompanied Singing Voice Synthesis with Fully Text-controlled Melody

Ruiqi Li, Zhiqing Hong, Yongqi Wang, Lichao Zhang, Rongjie Huang, Siqi Zheng, Zhou Zhao

TL;DR

MelodyLM tackles text-to-song generation by removing the need for user-provided music scores and enabling fully text-controlled melodies. It introduces a three-stage LM-based pipeline—text-to-MIDI (MIDI-LM), text-to-vocal (Vocal-LM), and vocal-to-accompaniment via a latent diffusion model with hybrid conditioning—using MIDI as the intermediate melody feature and prompts to steer both melody and accompaniment. The approach achieves superior objective and subjective performance on Mandarin pop data, while demonstrating strong controllability even with minimal input (lyrics and a vocal reference). This framework closes the gap between natural language descriptions and high-quality, synchronized singing plus accompaniment, with potential for flexible, user-driven song synthesis. The work also provides comprehensive ablations illustrating the importance of MIDI representation, prompt conditioning, and staged generation for quality and controllability.

Abstract

Text-to-song (TTSong) is a music generation task that synthesizes accompanied singing voices. Current TTSong methods, inherited from singing voice synthesis (SVS), require melody-related information that can sometimes be impractical, such as music scores or MIDI sequences. We present MelodyLM, the first TTSong model that generates high-quality song pieces with fully text-controlled melodies, achieving minimal user requirements and maximum control flexibility. MelodyLM explicitly models MIDI as the intermediate melody-related feature and sequentially generates vocal tracks in a language model manner, conditioned on textual and vocal prompts. The accompaniment music is subsequently synthesized by a latent diffusion model with hybrid conditioning for temporal alignment. With minimal requirements, users only need to input lyrics and a reference voice to synthesize a song sample. For full control, just input textual prompts or even directly input MIDI. Experimental results indicate that MelodyLM achieves superior performance in terms of both objective and subjective metrics. Audio samples are available at https://melodylm666.github.io.

Accompanied Singing Voice Synthesis with Fully Text-controlled Melody

TL;DR

MelodyLM tackles text-to-song generation by removing the need for user-provided music scores and enabling fully text-controlled melodies. It introduces a three-stage LM-based pipeline—text-to-MIDI (MIDI-LM), text-to-vocal (Vocal-LM), and vocal-to-accompaniment via a latent diffusion model with hybrid conditioning—using MIDI as the intermediate melody feature and prompts to steer both melody and accompaniment. The approach achieves superior objective and subjective performance on Mandarin pop data, while demonstrating strong controllability even with minimal input (lyrics and a vocal reference). This framework closes the gap between natural language descriptions and high-quality, synchronized singing plus accompaniment, with potential for flexible, user-driven song synthesis. The work also provides comprehensive ablations illustrating the importance of MIDI representation, prompt conditioning, and staged generation for quality and controllability.

Abstract

Text-to-song (TTSong) is a music generation task that synthesizes accompanied singing voices. Current TTSong methods, inherited from singing voice synthesis (SVS), require melody-related information that can sometimes be impractical, such as music scores or MIDI sequences. We present MelodyLM, the first TTSong model that generates high-quality song pieces with fully text-controlled melodies, achieving minimal user requirements and maximum control flexibility. MelodyLM explicitly models MIDI as the intermediate melody-related feature and sequentially generates vocal tracks in a language model manner, conditioned on textual and vocal prompts. The accompaniment music is subsequently synthesized by a latent diffusion model with hybrid conditioning for temporal alignment. With minimal requirements, users only need to input lyrics and a reference voice to synthesize a song sample. For full control, just input textual prompts or even directly input MIDI. Experimental results indicate that MelodyLM achieves superior performance in terms of both objective and subjective metrics. Audio samples are available at https://melodylm666.github.io.
Paper Structure (44 sections, 3 equations, 5 figures, 5 tables)

This paper contains 44 sections, 3 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of MelodyLM.
  • Figure 2: The overall architecture. The gray dashed lines indicate optional inputs. Modules printed with a lock are frozen during the training stage. Lyrics (semantic) and (acoustic) are essentially the same input lyrics, except the former provides potential semantic information (hence processed by a text encoder) and the latter only provides pronunciation-related acoustic information.
  • Figure 3: Multi-scale language modeling for MIDI tokens and vocal acoustic tokens.
  • Figure 4: Accompaniment latent diffusion with hybrid conditioning.
  • Figure 5: Visualization of the pitch and prosody modeling.