TokSing: Singing Voice Synthesis based on Discrete Tokens

Yuning Wu; Chunlei zhang; Jiatong Shi; Yuxun Tang; Shan Yang; Qin Jin

TokSing: Singing Voice Synthesis based on Discrete Tokens

Yuning Wu, Chunlei zhang, Jiatong Shi, Yuxun Tang, Shan Yang, Qin Jin

TL;DR

The paper tackles singing voice synthesis with discrete tokens by introducing TokSing, which blends SSL-derived token sources across layers and models to capture diverse singing semantics. It augments token-based representations with a melody signal $LF0$ and trains with a melody loss $\orall m$ and a token prediction loss $\mathcal{L}_{\text{tok}}$, enabling improved melody expression. Experiments on Opencpop and ACE-Opencpop show TokSing achieving higher objective and subjective quality than Mel-spectrogram baselines, while reducing intermediate representation cost and accelerating convergence. Overall, the work demonstrates the viability and efficiency of discrete-token SVS with melody-aware enhancements and cross-model token blending.

Abstract

Recent advancements in speech synthesis witness significant benefits by leveraging discrete tokens extracted from self-supervised learning (SSL) models. Discrete tokens offer higher storage efficiency and greater operability in intermediate representations compared to traditional continuous Mel spectrograms. However, when it comes to singing voice synthesis(SVS), achieving higher levels of melody expression poses a great challenge for utilizing discrete tokens. In this paper, we introduce TokSing, a discrete-based SVS system equipped with a token formulator that offers flexible token blendings. We observe a melody degradation during discretization, prompting us to integrate a melody signal with the discrete token and incorporate a specially-designed melody enhancement strategy in the musical encoder. Extensive experiments demonstrate that our TokSing achieves better performance against the Mel spectrogram baselines while offering advantages in intermediate representation space cost and convergence speed.

TokSing: Singing Voice Synthesis based on Discrete Tokens

TL;DR

and trains with a melody loss

and a token prediction loss

, enabling improved melody expression. Experiments on Opencpop and ACE-Opencpop show TokSing achieving higher objective and subjective quality than Mel-spectrogram baselines, while reducing intermediate representation cost and accelerating convergence. Overall, the work demonstrates the viability and efficiency of discrete-token SVS with melody-aware enhancements and cross-model token blending.

Abstract

Paper Structure (15 sections, 4 figures, 5 tables)

This paper contains 15 sections, 4 figures, 5 tables.

Introduction
Method
Token Formulation
Musical Encoder
Vocoder
Experiment
Experimental Settings
Comparison Experiments
Ablation of the Reconstruction in Vocoder
Ablation of Musical Encoder.
Ablation of Token Formulations
Transfer Learning
Conclusion
Acknowledgements
References

Figures (4)

Figure 1: Discrete-based SVS system architecture. The system contains three parts: a token formulator, a musical encoder and a vocoder. $LF0$, the logarithm of the fundamental frequency melody signal, serves as the melody signal. The ablations of melody prediction modules (purple blocks) and melody enhancing module (pink block) are discussed in Section \ref{['ssec: melody']}. $d_i^j$ represents discrete token.
Figure 2: Token Formulation 1/2/3 refer to forming from different layers of the same model, blending from different models, and generating from residual quantization, respectively.
Figure 3: Visualization of generated audio segments by different systems. (e) and (f) are resynthesized by codec decoder.
Figure 4: Convergence speed comparison of TokSing and the Mel spectrogram-based system.

TokSing: Singing Voice Synthesis based on Discrete Tokens

TL;DR

Abstract

TokSing: Singing Voice Synthesis based on Discrete Tokens

Authors

TL;DR

Abstract

Table of Contents

Figures (4)