Speech Foundation Model Ensembles for the Controlled Singing Voice Deepfake Detection (CtrSVDD) Challenge 2024

Anmol Guragain; Tianchi Liu; Zihan Pan; Hardik B. Sailor; Qiongqiong Wang

Speech Foundation Model Ensembles for the Controlled Singing Voice Deepfake Detection (CtrSVDD) Challenge 2024

Anmol Guragain, Tianchi Liu, Zihan Pan, Hardik B. Sailor, Qiongqiong Wang

TL;DR

This work details the approach to achieving a leading system with a 1.79% pooled equal error rate (EER) on the evaluation set of the Controlled Singing Voice Deepfake Detection (CtrSVDD), and introduces a novel Squeeze-and-Excitation Aggregation method, which efficiently and effectively integrates representation features from the speech foundation models, surpassing the performance of the other individual systems.

Abstract

This work details our approach to achieving a leading system with a 1.79% pooled equal error rate (EER) on the evaluation set of the Controlled Singing Voice Deepfake Detection (CtrSVDD). The rapid advancement of generative AI models presents significant challenges for detecting AI-generated deepfake singing voices, attracting increased research attention. The Singing Voice Deepfake Detection (SVDD) Challenge 2024 aims to address this complex task. In this work, we explore the ensemble methods, utilizing speech foundation models to develop robust singing voice anti-spoofing systems. We also introduce a novel Squeeze-and-Excitation Aggregation (SEA) method, which efficiently and effectively integrates representation features from the speech foundation models, surpassing the performance of our other individual systems. Evaluation results confirm the efficacy of our approach in detecting deepfake singing voices. The codes can be accessed at https://github.com/Anmol2059/SVDD2024.

Speech Foundation Model Ensembles for the Controlled Singing Voice Deepfake Detection (CtrSVDD) Challenge 2024

TL;DR

Abstract

Paper Structure (22 sections, 2 figures, 3 tables)

This paper contains 22 sections, 2 figures, 3 tables.

Introduction
Methodology
Data Augmentation
Parallel Noise Addition
Sequential Noise Addition
Individual Models Description
Frontend
Layer Aggregation Strategy
Backend
Classifier
Model Ensembling
Experimental Setup
Data Set
Training Strategy
Results
...and 7 more sections

Figures (2)

Figure 1: The system architecture of a speech foundation model-based singing voice deepfake detection system. The top-left corner shows the legend. The bottom-left section illustrates the SSL-based front-end, with its output being representation features of $N \times F \times T$, where $N$ is the number of layers in the SSL encoder, $F$ is the dimension of the representation features, and $T$ is the number of frames. In this figure, $N$ = 6 is used as an example. The right side details the layer aggregation process, including the three aggregation strategies used in this work: (a) Weighted Sum, (b) Attentive Merging (AttM) attentive_merge, and (c) the proposed SE Aggregation (SEA).
Figure 2: The radar chart comparing the performance of our best individual model (M9) and the best ensemble system (E5) in terms of EER on sub-trials of the CtrSVDD evaluation set.

Speech Foundation Model Ensembles for the Controlled Singing Voice Deepfake Detection (CtrSVDD) Challenge 2024

TL;DR

Abstract

Speech Foundation Model Ensembles for the Controlled Singing Voice Deepfake Detection (CtrSVDD) Challenge 2024

Authors

TL;DR

Abstract

Table of Contents

Figures (2)