Table of Contents
Fetching ...

Low-Resource Cross-Domain Singing Voice Synthesis via Reduced Self-Supervised Speech Representations

Panos Kakoulidis, Nikolaos Ellinas, Georgios Vamvoukakis, Myrsini Christidou, Alexandra Vioni, Georgia Maniati, Junkwang Oh, Gunu Jho, Inchul Hwang, Pirros Tsiakoulis, Aimilios Chalamandaris

TL;DR

The paper tackles singing voice synthesis under low-resource constraints by introducing Karaoker-SSL, which uses reduced self-supervised speech representations to condition a non-attentive acoustic model trained only on text and speech data. It combines pitch-prediction as a multi-task objective and a U-Net discriminator within a Diffusion GAN framework to refine mel spectrograms, all without singing data or lyrics timestamps during training. A key contribution is selecting task-relevant SSL dimensions from parallel speech-singing data, enabling unsupervised conditioning that preserves style while reducing linguistic content. The approach demonstrates competitive subjective and objective performance for cross-domain SVS, offering a scalable, resource-efficient path toward deployment without domain-specific singing datasets.

Abstract

In this paper, we propose a singing voice synthesis model, Karaoker-SSL, that is trained only on text and speech data as a typical multi-speaker acoustic model. It is a low-resource pipeline that does not utilize any singing data end-to-end, since its vocoder is also trained on speech data. Karaoker-SSL is conditioned by self-supervised speech representations in an unsupervised manner. We preprocess these representations by selecting only a subset of their task-correlated dimensions. The conditioning module is indirectly guided to capture style information during training by multi-tasking. This is achieved with a Conformer-based module, which predicts the pitch from the acoustic model's output. Thus, Karaoker-SSL allows singing voice synthesis without reliance on hand-crafted and domain-specific features. There are also no requirements for text alignments or lyrics timestamps. To refine the voice quality, we employ a U-Net discriminator that is conditioned on the target speaker and follows a Diffusion GAN training scheme.

Low-Resource Cross-Domain Singing Voice Synthesis via Reduced Self-Supervised Speech Representations

TL;DR

The paper tackles singing voice synthesis under low-resource constraints by introducing Karaoker-SSL, which uses reduced self-supervised speech representations to condition a non-attentive acoustic model trained only on text and speech data. It combines pitch-prediction as a multi-task objective and a U-Net discriminator within a Diffusion GAN framework to refine mel spectrograms, all without singing data or lyrics timestamps during training. A key contribution is selecting task-relevant SSL dimensions from parallel speech-singing data, enabling unsupervised conditioning that preserves style while reducing linguistic content. The approach demonstrates competitive subjective and objective performance for cross-domain SVS, offering a scalable, resource-efficient path toward deployment without domain-specific singing datasets.

Abstract

In this paper, we propose a singing voice synthesis model, Karaoker-SSL, that is trained only on text and speech data as a typical multi-speaker acoustic model. It is a low-resource pipeline that does not utilize any singing data end-to-end, since its vocoder is also trained on speech data. Karaoker-SSL is conditioned by self-supervised speech representations in an unsupervised manner. We preprocess these representations by selecting only a subset of their task-correlated dimensions. The conditioning module is indirectly guided to capture style information during training by multi-tasking. This is achieved with a Conformer-based module, which predicts the pitch from the acoustic model's output. Thus, Karaoker-SSL allows singing voice synthesis without reliance on hand-crafted and domain-specific features. There are also no requirements for text alignments or lyrics timestamps. To refine the voice quality, we employ a U-Net discriminator that is conditioned on the target speaker and follows a Diffusion GAN training scheme.
Paper Structure (17 sections, 2 equations, 2 figures, 3 tables)

This paper contains 17 sections, 2 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Overlapped WavLM embeddings (768 dim.), each one averaged on the time dimension, for two parallel audios of the same speaker and content. Blue is for normal speech and orange is for singing.
  • Figure 2: Architecture of Karaoker-SSL