Table of Contents
Fetching ...

TokSing: Singing Voice Synthesis based on Discrete Tokens

Yuning Wu, Chunlei zhang, Jiatong Shi, Yuxun Tang, Shan Yang, Qin Jin

TL;DR

The paper tackles singing voice synthesis with discrete tokens by introducing TokSing, which blends SSL-derived token sources across layers and models to capture diverse singing semantics. It augments token-based representations with a melody signal $LF0$ and trains with a melody loss $\ orall m$ and a token prediction loss $\mathcal{L}_{\text{tok}}$, enabling improved melody expression. Experiments on Opencpop and ACE-Opencpop show TokSing achieving higher objective and subjective quality than Mel-spectrogram baselines, while reducing intermediate representation cost and accelerating convergence. Overall, the work demonstrates the viability and efficiency of discrete-token SVS with melody-aware enhancements and cross-model token blending.

Abstract

Recent advancements in speech synthesis witness significant benefits by leveraging discrete tokens extracted from self-supervised learning (SSL) models. Discrete tokens offer higher storage efficiency and greater operability in intermediate representations compared to traditional continuous Mel spectrograms. However, when it comes to singing voice synthesis(SVS), achieving higher levels of melody expression poses a great challenge for utilizing discrete tokens. In this paper, we introduce TokSing, a discrete-based SVS system equipped with a token formulator that offers flexible token blendings. We observe a melody degradation during discretization, prompting us to integrate a melody signal with the discrete token and incorporate a specially-designed melody enhancement strategy in the musical encoder. Extensive experiments demonstrate that our TokSing achieves better performance against the Mel spectrogram baselines while offering advantages in intermediate representation space cost and convergence speed.

TokSing: Singing Voice Synthesis based on Discrete Tokens

TL;DR

The paper tackles singing voice synthesis with discrete tokens by introducing TokSing, which blends SSL-derived token sources across layers and models to capture diverse singing semantics. It augments token-based representations with a melody signal and trains with a melody loss and a token prediction loss , enabling improved melody expression. Experiments on Opencpop and ACE-Opencpop show TokSing achieving higher objective and subjective quality than Mel-spectrogram baselines, while reducing intermediate representation cost and accelerating convergence. Overall, the work demonstrates the viability and efficiency of discrete-token SVS with melody-aware enhancements and cross-model token blending.

Abstract

Recent advancements in speech synthesis witness significant benefits by leveraging discrete tokens extracted from self-supervised learning (SSL) models. Discrete tokens offer higher storage efficiency and greater operability in intermediate representations compared to traditional continuous Mel spectrograms. However, when it comes to singing voice synthesis(SVS), achieving higher levels of melody expression poses a great challenge for utilizing discrete tokens. In this paper, we introduce TokSing, a discrete-based SVS system equipped with a token formulator that offers flexible token blendings. We observe a melody degradation during discretization, prompting us to integrate a melody signal with the discrete token and incorporate a specially-designed melody enhancement strategy in the musical encoder. Extensive experiments demonstrate that our TokSing achieves better performance against the Mel spectrogram baselines while offering advantages in intermediate representation space cost and convergence speed.
Paper Structure (15 sections, 4 figures, 5 tables)

This paper contains 15 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Discrete-based SVS system architecture. The system contains three parts: a token formulator, a musical encoder and a vocoder. $LF0$, the logarithm of the fundamental frequency melody signal, serves as the melody signal. The ablations of melody prediction modules (purple blocks) and melody enhancing module (pink block) are discussed in Section \ref{['ssec: melody']}. $d_i^j$ represents discrete token.
  • Figure 2: Token Formulation 1/2/3 refer to forming from different layers of the same model, blending from different models, and generating from residual quantization, respectively.
  • Figure 3: Visualization of generated audio segments by different systems. (e) and (f) are resynthesized by codec decoder.
  • Figure 4: Convergence speed comparison of TokSing and the Mel spectrogram-based system.