Table of Contents
Fetching ...

SVDD Challenge 2024: A Singing Voice Deepfake Detection Challenge Evaluation Plan

You Zhang, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Tomoki Toda, Zhiyao Duan

TL;DR

The SVDD Challenge 2024 introduces the first dedicated evaluation plan for singing-voice deepfake detection, addressing the distinct challenges of SVDD with two realistic tracks: controlled (CtrSVDD) and in-the-wild (WildSVDD). It leverages curated datasets, including a licensing-aware CtrSVDD dataset and a substantially enlarged WildSVDD corpus, and uses $EER$ as the primary performance metric to assess robustness against unseen generators. Baseline systems built on the AASIST framework with both LFCC and raw waveform front-ends reveal generalization gaps to novel deepfake methods, motivating further research into robust SVDD models. The plan also outlines data usage rules, submission pipelines via CodaBench, and a pathway for disseminating results and descriptions at SLT 2024, facilitating shared progress and reproducibility in singing-voice deepfake detection.

Abstract

The rapid advancement of AI-generated singing voices, which now closely mimic natural human singing and align seamlessly with musical scores, has led to heightened concerns for artists and the music industry. Unlike spoken voice, singing voice presents unique challenges due to its musical nature and the presence of strong background music, making singing voice deepfake detection (SVDD) a specialized field requiring focused attention. To promote SVDD research, we recently proposed the "SVDD Challenge," the very first research challenge focusing on SVDD for lab-controlled and in-the-wild bonafide and deepfake singing voice recordings. The challenge will be held in conjunction with the 2024 IEEE Spoken Language Technology Workshop (SLT 2024).

SVDD Challenge 2024: A Singing Voice Deepfake Detection Challenge Evaluation Plan

TL;DR

The SVDD Challenge 2024 introduces the first dedicated evaluation plan for singing-voice deepfake detection, addressing the distinct challenges of SVDD with two realistic tracks: controlled (CtrSVDD) and in-the-wild (WildSVDD). It leverages curated datasets, including a licensing-aware CtrSVDD dataset and a substantially enlarged WildSVDD corpus, and uses as the primary performance metric to assess robustness against unseen generators. Baseline systems built on the AASIST framework with both LFCC and raw waveform front-ends reveal generalization gaps to novel deepfake methods, motivating further research into robust SVDD models. The plan also outlines data usage rules, submission pipelines via CodaBench, and a pathway for disseminating results and descriptions at SLT 2024, facilitating shared progress and reproducibility in singing-voice deepfake detection.

Abstract

The rapid advancement of AI-generated singing voices, which now closely mimic natural human singing and align seamlessly with musical scores, has led to heightened concerns for artists and the music industry. Unlike spoken voice, singing voice presents unique challenges due to its musical nature and the presence of strong background music, making singing voice deepfake detection (SVDD) a specialized field requiring focused attention. To promote SVDD research, we recently proposed the "SVDD Challenge," the very first research challenge focusing on SVDD for lab-controlled and in-the-wild bonafide and deepfake singing voice recordings. The challenge will be held in conjunction with the 2024 IEEE Spoken Language Technology Workshop (SLT 2024).
Paper Structure (16 sections, 3 figures, 1 table)

This paper contains 16 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Illustration of Equal Error Rate (EER).
  • Figure 2: Baseline systems architecture. We adjust the linear layer before the GAT backbone to adapt for different front-end dimensionalities. More details of HS-GAL are available in Jung2021AASIST.
  • Figure 3: Validation EER per training epoch. The lowest EER, indicating the checkpoint selected for evaluation, is marked by a red line for LFCC and a green line for raw waveform. Best viewed in color.