Singer Identity Representation Learning using Self-Supervised Techniques

Bernardo Torres; Stefan Lattner; Gaël Richard

Singer Identity Representation Learning using Self-Supervised Techniques

Bernardo Torres, Stefan Lattner, Gaël Richard

TL;DR

This work addresses the challenge of learning robust singer identity representations using self-supervised learning directly on singing voice data, rather than relying on labeled speech data. It compares four SSL frameworks—SimCLR, Uniformity-Alignment, VICReg, and BYOL—trained on a large corpus of isolated vocal tracks at 44.1 kHz, with data augmentations to encourage invariance to pitch and content. The authors evaluate on singer similarity and singer identification across multiple datasets, including out-of-domain scenarios, and show that SSL models can outperform public speech baselines while using far fewer parameters, with BYOL offering the best generalization. The findings highlight the value of high-frequency information for singing tasks and provide a reusable SSL framework and codebase to advance singing voice representation learning and related SVS/SVC applications.

Abstract

Significant strides have been made in creating voice identity representations using speech data. However, the same level of progress has not been achieved for singing voices. To bridge this gap, we suggest a framework for training singer identity encoders to extract representations suitable for various singing-related tasks, such as singing voice similarity and synthesis. We explore different self-supervised learning techniques on a large collection of isolated vocal tracks and apply data augmentations during training to ensure that the representations are invariant to pitch and content variations. We evaluate the quality of the resulting representations on singer similarity and identification tasks across multiple datasets, with a particular emphasis on out-of-domain generalization. Our proposed framework produces high-quality embeddings that outperform both speaker verification and wav2vec 2.0 pre-trained baselines on singing voice while operating at 44.1 kHz. We release our code and trained models to facilitate further research on singing voice and related areas.

Singer Identity Representation Learning using Self-Supervised Techniques

TL;DR

Abstract

Singer Identity Representation Learning using Self-Supervised Techniques

Authors

TL;DR

Abstract

Table of Contents

Figures (1)