TokSing: Singing Voice Synthesis based on Discrete Tokens
Yuning Wu, Chunlei zhang, Jiatong Shi, Yuxun Tang, Shan Yang, Qin Jin
TL;DR
The paper tackles singing voice synthesis with discrete tokens by introducing TokSing, which blends SSL-derived token sources across layers and models to capture diverse singing semantics. It augments token-based representations with a melody signal $LF0$ and trains with a melody loss $\orall m$ and a token prediction loss $\mathcal{L}_{\text{tok}}$, enabling improved melody expression. Experiments on Opencpop and ACE-Opencpop show TokSing achieving higher objective and subjective quality than Mel-spectrogram baselines, while reducing intermediate representation cost and accelerating convergence. Overall, the work demonstrates the viability and efficiency of discrete-token SVS with melody-aware enhancements and cross-model token blending.
Abstract
Recent advancements in speech synthesis witness significant benefits by leveraging discrete tokens extracted from self-supervised learning (SSL) models. Discrete tokens offer higher storage efficiency and greater operability in intermediate representations compared to traditional continuous Mel spectrograms. However, when it comes to singing voice synthesis(SVS), achieving higher levels of melody expression poses a great challenge for utilizing discrete tokens. In this paper, we introduce TokSing, a discrete-based SVS system equipped with a token formulator that offers flexible token blendings. We observe a melody degradation during discretization, prompting us to integrate a melody signal with the discrete token and incorporate a specially-designed melody enhancement strategy in the musical encoder. Extensive experiments demonstrate that our TokSing achieves better performance against the Mel spectrogram baselines while offering advantages in intermediate representation space cost and convergence speed.
