Table of Contents
Fetching ...

From Real to Cloned Singer Identification

Dorian Desblancs, Gabriel Meseguer-Brocal, Romain Hennequin, Manuel Moussallam

TL;DR

This work tackles the problem of identifying original singers when confronted with AI-generated clones in music. It introduces three embedding models trained with singer-level contrastive learning using inputs of mixtures, vocal stems, or both, and evaluates them on open (FMA, MTG) and large closed datasets, plus cloned-voice tracks. Real-singer identification is strong across models, but performance drops sharply for cloned voices—especially for mixture-based inputs—highlighting biases toward instrumental contexts and the need for robust, cloning-aware systems. The study provides open-source singer identification splits to benchmark progress and discusses future directions, including few-shot learning on cloned voices, to better combat voice deepfakes in music with practical implications for policy and platform decisions.

Abstract

Cloned voices of popular singers sound increasingly realistic and have gained popularity over the past few years. They however pose a threat to the industry due to personality rights concerns. As such, methods to identify the original singer in synthetic voices are needed. In this paper, we investigate how singer identification methods could be used for such a task. We present three embedding models that are trained using a singer-level contrastive learning scheme, where positive pairs consist of segments with vocals from the same singers. These segments can be mixtures for the first model, vocals for the second, and both for the third. We demonstrate that all three models are highly capable of identifying real singers. However, their performance deteriorates when classifying cloned versions of singers in our evaluation set. This is especially true for models that use mixtures as an input. These findings highlight the need to understand the biases that exist within singer identification systems, and how they can influence the identification of voice deepfakes in music.

From Real to Cloned Singer Identification

TL;DR

This work tackles the problem of identifying original singers when confronted with AI-generated clones in music. It introduces three embedding models trained with singer-level contrastive learning using inputs of mixtures, vocal stems, or both, and evaluates them on open (FMA, MTG) and large closed datasets, plus cloned-voice tracks. Real-singer identification is strong across models, but performance drops sharply for cloned voices—especially for mixture-based inputs—highlighting biases toward instrumental contexts and the need for robust, cloning-aware systems. The study provides open-source singer identification splits to benchmark progress and discusses future directions, including few-shot learning on cloned voices, to better combat voice deepfakes in music with practical implications for policy and platform decisions.

Abstract

Cloned voices of popular singers sound increasingly realistic and have gained popularity over the past few years. They however pose a threat to the industry due to personality rights concerns. As such, methods to identify the original singer in synthetic voices are needed. In this paper, we investigate how singer identification methods could be used for such a task. We present three embedding models that are trained using a singer-level contrastive learning scheme, where positive pairs consist of segments with vocals from the same singers. These segments can be mixtures for the first model, vocals for the second, and both for the third. We demonstrate that all three models are highly capable of identifying real singers. However, their performance deteriorates when classifying cloned versions of singers in our evaluation set. This is especially true for models that use mixtures as an input. These findings highlight the need to understand the biases that exist within singer identification systems, and how they can influence the identification of voice deepfakes in music.
Paper Structure (12 sections, 1 equation, 5 figures, 3 tables)

This paper contains 12 sections, 1 equation, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Singer identification results obtained on the closed (left) and cloned (right) datasets. We display results over 10 runs for 100 to 1000 singer classes. For each number of classes, we display the top-1 and top-5 accuracies for each run (pale markers), and the mean results between all runs (prominent markers). On the closed dataset, we randomly sample a subset of the 7500 singers on every run and display results on their test tracks. For the cloned dataset, we train our models to classify the 67 cloned singers and other randomly selected singers. We then display the results on the 377 spoofed tracks.
  • Figure 2: Vocal model performance by genre when trying to classify 1500 singers. The macro genre tags are gathered from Deezer and are unique for each test track. We display the mean top-5 accuracy for each run with the orange dots. The boxes then display the median and interquartile range (IQR) between runs. The whiskers extend to points that lie within 1.5 IQRs of the lower and upper quantiles. Finally, outlier runs have circles drawn around them. Genres containing less than 100 test tracks are omitted from this plot.
  • Figure 3: Vocal model performance over 500, 1000, and 1500-singer identification. We report results from each run in buckets that describe the number of training tracks per singer, that are used to train our classifiers. In the first, we display the top-1 accuracies observed for singers with only 5 to 9 training tracks. In the second, we display the top-1 accuracies observed for singers with 10 to 19 training tracks. Finally, in the last, we display the top-1 accuracies observed for singers with 20 or more training tracks. We report results using violin plots, where, for each bucket, the inner figure is a box plot similar to that in Figure \ref{['fig:genres']} and the outer figure is a kernel density estimation of the data.
  • Figure 4: Mean all-pairs cosine similarity between each of the closed set singers' test track embeddings and: in purple (test/other), the embeddings from a random track from another singer; in red (test/val), their validation track embeddings; in green (test/vocal), their test track's vocal stem embeddings; in orange (test/instru), their test track's instrumental stem embeddings; in blue (test/test), the other embeddings from the same track. All embeddings are generated on segments with vocals.
  • Figure :