Low-Resource Cross-Domain Singing Voice Synthesis via Reduced Self-Supervised Speech Representations

Panos Kakoulidis; Nikolaos Ellinas; Georgios Vamvoukakis; Myrsini Christidou; Alexandra Vioni; Georgia Maniati; Junkwang Oh; Gunu Jho; Inchul Hwang; Pirros Tsiakoulis; Aimilios Chalamandaris

Low-Resource Cross-Domain Singing Voice Synthesis via Reduced Self-Supervised Speech Representations

Panos Kakoulidis, Nikolaos Ellinas, Georgios Vamvoukakis, Myrsini Christidou, Alexandra Vioni, Georgia Maniati, Junkwang Oh, Gunu Jho, Inchul Hwang, Pirros Tsiakoulis, Aimilios Chalamandaris

TL;DR

The paper tackles singing voice synthesis under low-resource constraints by introducing Karaoker-SSL, which uses reduced self-supervised speech representations to condition a non-attentive acoustic model trained only on text and speech data. It combines pitch-prediction as a multi-task objective and a U-Net discriminator within a Diffusion GAN framework to refine mel spectrograms, all without singing data or lyrics timestamps during training. A key contribution is selecting task-relevant SSL dimensions from parallel speech-singing data, enabling unsupervised conditioning that preserves style while reducing linguistic content. The approach demonstrates competitive subjective and objective performance for cross-domain SVS, offering a scalable, resource-efficient path toward deployment without domain-specific singing datasets.

Abstract

In this paper, we propose a singing voice synthesis model, Karaoker-SSL, that is trained only on text and speech data as a typical multi-speaker acoustic model. It is a low-resource pipeline that does not utilize any singing data end-to-end, since its vocoder is also trained on speech data. Karaoker-SSL is conditioned by self-supervised speech representations in an unsupervised manner. We preprocess these representations by selecting only a subset of their task-correlated dimensions. The conditioning module is indirectly guided to capture style information during training by multi-tasking. This is achieved with a Conformer-based module, which predicts the pitch from the acoustic model's output. Thus, Karaoker-SSL allows singing voice synthesis without reliance on hand-crafted and domain-specific features. There are also no requirements for text alignments or lyrics timestamps. To refine the voice quality, we employ a U-Net discriminator that is conditioned on the target speaker and follows a Diffusion GAN training scheme.

Low-Resource Cross-Domain Singing Voice Synthesis via Reduced Self-Supervised Speech Representations

TL;DR

Abstract

Low-Resource Cross-Domain Singing Voice Synthesis via Reduced Self-Supervised Speech Representations

Authors

TL;DR

Abstract

Table of Contents

Figures (2)