Table of Contents
Fetching ...

Adapting Speech Language Model to Singing Voice Synthesis

Yiwen Zhao, Jiatong Shi, Jinchuan Tian, Yuxun Tang, Jiarui Hai, Jionghao Han, Shinji Watanabe

TL;DR

This work investigates the generalization of Speech Language Models to Singing Voice Synthesis by adapting a 1.7B TTS-pretrained SLM to SVS using a 135-hour synthetic corpus. The approach tokenizes music scores and singing waveforms into a multi-stream LM input, then applies a conditional flow-matching refinement to generate mel-spectrograms for a compatible vocoder, addressing data scarcity and token noise. Experiments show the SLM-based SVS pipeline achieves competitive performance with leading discrete SVS models, with mel-space flow outperforming direct codec resynthesis and pitch conditioning providing additional gains. The results underscore the strong generalizability of SLMs for low-resource, multi-modal tasks and suggest promising directions for future multi-task SLM research in audio domains.

Abstract

Speech Language Models (SLMs) have recently emerged as a unified paradigm for addressing a wide range of speech-related tasks, including text-to-speech (TTS), speech enhancement (SE), and automatic speech recognition (ASR). However, the generalization capability of large-scale pre-trained SLMs remains underexplored. In this work, we adapt a 1.7B parameter TTS pretrained SLM for singing voice synthesis (SVS), using only a 135-hour synthetic singing corpus, ACE-Opencpop. Building upon the ESPNet-SpeechLM, our recipe involves the following procedure: (1) tokenization of music score conditions and singing waveforms, (2) multi-stream language model token prediction, (3) conditional flow matching-based mel-spectrogram generation. (4) a mel-to-wave vocoder. Experimental results demonstrate that our adapted SLM generalizes well to SVS and achieves performance comparable to leading discrete token-based SVS models.

Adapting Speech Language Model to Singing Voice Synthesis

TL;DR

This work investigates the generalization of Speech Language Models to Singing Voice Synthesis by adapting a 1.7B TTS-pretrained SLM to SVS using a 135-hour synthetic corpus. The approach tokenizes music scores and singing waveforms into a multi-stream LM input, then applies a conditional flow-matching refinement to generate mel-spectrograms for a compatible vocoder, addressing data scarcity and token noise. Experiments show the SLM-based SVS pipeline achieves competitive performance with leading discrete SVS models, with mel-space flow outperforming direct codec resynthesis and pitch conditioning providing additional gains. The results underscore the strong generalizability of SLMs for low-resource, multi-modal tasks and suggest promising directions for future multi-task SLM research in audio domains.

Abstract

Speech Language Models (SLMs) have recently emerged as a unified paradigm for addressing a wide range of speech-related tasks, including text-to-speech (TTS), speech enhancement (SE), and automatic speech recognition (ASR). However, the generalization capability of large-scale pre-trained SLMs remains underexplored. In this work, we adapt a 1.7B parameter TTS pretrained SLM for singing voice synthesis (SVS), using only a 135-hour synthetic singing corpus, ACE-Opencpop. Building upon the ESPNet-SpeechLM, our recipe involves the following procedure: (1) tokenization of music score conditions and singing waveforms, (2) multi-stream language model token prediction, (3) conditional flow matching-based mel-spectrogram generation. (4) a mel-to-wave vocoder. Experimental results demonstrate that our adapted SLM generalizes well to SVS and achieves performance comparable to leading discrete token-based SVS models.

Paper Structure

This paper contains 16 sections, 7 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Illustration of music notes tokenization and waveform tokenization. The phoneme, duration, and MIDI are quantized to 50FPS discrete tokens, appended to the TTS vocabulary. The audio tokens are obtained by a pretrained codec encoder and SSL model, with each frame represented by a concatenation of one SSL token and eight codec tokens.
  • Figure 2: SVS fine-tuning task template. The audio codec tokens and SSL tokens are concatenated along the RVQ stream axis.
  • Figure 3: Training and inference process of flow matching.