Table of Contents
Fetching ...

TechSinger: Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching

Wenxiang Guo, Yu Zhang, Changhao Pan, Rongjie Huang, Li Tang, Ruiqi Li, Zhiqing Hong, Yongqi Wang, Zhou Zhao

TL;DR

TechSinger tackles the lack of fine-grained technique control in multilingual singing voice synthesis by introducing a flow-matching framework with a Flow Matching Pitch Predictor, a CFG-based mel-Postnet, a technique detector, and a prompt-based technique predictor. A data-augmentation strategy via automatic technique annotation and GPT-4o-powered prompts enables robust, flexible control over seven vocal techniques across five languages. Empirical results show TechSinger achieves higher audio quality and more expressive, technique-aligned synthesis than baselines, with ablations validating each component. This work advances controllable SVS by integrating flow-based generation, automatic technique labeling, and natural-language prompts, enabling expressive, user-friendly, multilingual singing synthesis in practical settings.

Abstract

Singing voice synthesis has made remarkable progress in generating natural and high-quality voices. However, existing methods rarely provide precise control over vocal techniques such as intensity, mixed voice, falsetto, bubble, and breathy tones, thus limiting the expressive potential of synthetic voices. We introduce TechSinger, an advanced system for controllable singing voice synthesis that supports five languages and seven vocal techniques. TechSinger leverages a flow-matching-based generative model to produce singing voices with enhanced expressive control over various techniques. To enhance the diversity of training data, we develop a technique detection model that automatically annotates datasets with phoneme-level technique labels. Additionally, our prompt-based technique prediction model enables users to specify desired vocal attributes through natural language, offering fine-grained control over the synthesized singing. Experimental results demonstrate that TechSinger significantly enhances the expressiveness and realism of synthetic singing voices, outperforming existing methods in terms of audio quality and technique-specific control. Audio samples can be found at https://gwx314.github.io/tech-singer/.

TechSinger: Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching

TL;DR

TechSinger tackles the lack of fine-grained technique control in multilingual singing voice synthesis by introducing a flow-matching framework with a Flow Matching Pitch Predictor, a CFG-based mel-Postnet, a technique detector, and a prompt-based technique predictor. A data-augmentation strategy via automatic technique annotation and GPT-4o-powered prompts enables robust, flexible control over seven vocal techniques across five languages. Empirical results show TechSinger achieves higher audio quality and more expressive, technique-aligned synthesis than baselines, with ablations validating each component. This work advances controllable SVS by integrating flow-based generation, automatic technique labeling, and natural-language prompts, enabling expressive, user-friendly, multilingual singing synthesis in practical settings.

Abstract

Singing voice synthesis has made remarkable progress in generating natural and high-quality voices. However, existing methods rarely provide precise control over vocal techniques such as intensity, mixed voice, falsetto, bubble, and breathy tones, thus limiting the expressive potential of synthetic voices. We introduce TechSinger, an advanced system for controllable singing voice synthesis that supports five languages and seven vocal techniques. TechSinger leverages a flow-matching-based generative model to produce singing voices with enhanced expressive control over various techniques. To enhance the diversity of training data, we develop a technique detection model that automatically annotates datasets with phoneme-level technique labels. Additionally, our prompt-based technique prediction model enables users to specify desired vocal attributes through natural language, offering fine-grained control over the synthesized singing. Experimental results demonstrate that TechSinger significantly enhances the expressiveness and realism of synthetic singing voices, outperforming existing methods in terms of audio quality and technique-specific control. Audio samples can be found at https://gwx314.github.io/tech-singer/.

Paper Structure

This paper contains 39 sections, 13 equations, 8 figures, 8 tables, 2 algorithms.

Figures (8)

  • Figure 1: The overall architecture of TechSinger. In Figure (a), the technique predictor can predict technique sequences with natural language prompts. The flow matching pitch predictor (FMPP) conditions on the expanded input encoding $E_p$ to generate the F0 sequences. The mel decoder generates the coarse mel-spectrogram. The vector field estimator infers the vector field $v_m$. In Figure (b), $v_m$ is used to flow the standard Gaussian noise into a fine mel-spectrogram via an ODE solver. In Figure (c), the input of the technique predictor is prompt, note, and lyrics. The text encoder is a pre-trained language model.
  • Figure 2: The architecture of the technique detector.
  • Figure 3: Visualization of the mel-spectrograms and pitch contour of the ground-truth and results of different SVS systems.
  • Figure 4: Visualization of the mel-spectrogram results generated by TechSinger under different techniques. The red box contains the fundamental pitch, and the yellow box contains the details of harmonics.
  • Figure 5: The detailed architecture of the vector field estimator
  • ...and 3 more figures