SpoofCeleb: Speech Deepfake Detection and SASV In The Wild
Jee-weon Jung, Yihan Wu, Xin Wang, Ji-Hoon Kim, Soumi Maiti, Yuta Matsunaga, Hye-jin Shim, Jinchuan Tian, Nicholas Evans, Joon Son Chung, Wangyou Zhang, Seyun Um, Shinnosuke Takamichi, Shinji Watanabe
TL;DR
SpoofCeleb tackles the lack of in-the-wild data for speech deepfake detection and spoofing-robust speaker verification by creating a VoxCeleb1-derived TITW-Easy pipeline to train 23 TTS-based spoofing systems, yielding over 2.5 million samples from 1,251 speakers. The dataset enables concurrent evaluation of SDD and SASV with carefully designed train/validation/evaluation splits and 23 diverse spoofing attacks across multiple acoustic, waveform, and E2E models. Baseline SDD and SASV systems reveal that in-the-wild training substantially improves robustness, with SpoofCeleb achieving the best SASV a-DCF and more balanced error rates, while highlighting the persistent gap between real-world spoofing threats and current defenses. The work demonstrates the value of large-scale, real-world data for advancing spoofing-resilient speaker verification and outlines directions for improving TTS training on challenging in-the-wild data.
Abstract
This paper introduces SpoofCeleb, a dataset designed for Speech Deepfake Detection (SDD) and Spoofing-robust Automatic Speaker Verification (SASV), utilizing source data from real-world conditions and spoofing attacks generated by Text-To-Speech (TTS) systems also trained on the same real-world data. Robust recognition systems require speech data recorded in varied acoustic environments with different levels of noise to be trained. However, current datasets typically include clean, high-quality recordings (bona fide data) due to the requirements for TTS training; studio-quality or well-recorded read speech is typically necessary to train TTS models. Current SDD datasets also have limited usefulness for training SASV models due to insufficient speaker diversity. SpoofCeleb leverages a fully automated pipeline we developed that processes the VoxCeleb1 dataset, transforming it into a suitable form for TTS training. We subsequently train 23 contemporary TTS systems. SpoofCeleb comprises over 2.5 million utterances from 1,251 unique speakers, collected under natural, real-world conditions. The dataset includes carefully partitioned training, validation, and evaluation sets with well-controlled experimental protocols. We present the baseline results for both SDD and SASV tasks. All data, protocols, and baselines are publicly available at https://jungjee.github.io/spoofceleb.
