Learn from Real: Reality Defender's Submission to ASVspoof5 Challenge
Yi Zhu, Chirag Goel, Surya Koppisetti, Trang Tran, Ankur Kumar, Gaurav Bharaj
TL;DR
This work tackles robust audio deepfake detection under unseen attacks by addressing generalization with a low-cost training regime. It introduces SLIM, a two-stage framework where Stage 1 uses self-supervised contrastive learning to capture style-linguistics dependencies from real speech, and Stage 2 fuses these dependency embeddings with raw SSL features for supervised discrimination. The method achieves a strong ASVspoof5 Track 1 result with a minDCF of $0.1499$ and EER of $5.56\%$, and demonstrates notable generalization to out-of-domain data (ASV2019 LA and ITW) with EERs of $7.4\%$ and $10.8\%$, respectively, while maintaining a relatively small training footprint (~$7$ million trainable parameters and <$15$ hours). Ablation studies suggest further gains are possible with focal loss and codec-specific augmentations, and the approach reduces the generalization gap compared to baselines. Overall, SLIM provides a practical, generalizable, and computation-efficient avenue for deepfake speech detection.
Abstract
Audio deepfake detection is crucial to combat the malicious use of AI-synthesized speech. Among many efforts undertaken by the community, the ASVspoof challenge has become one of the benchmarks to evaluate the generalizability and robustness of detection models. In this paper, we present Reality Defender's submission to the ASVspoof5 challenge, highlighting a novel pretraining strategy which significantly improves generalizability while maintaining low computational cost during training. Our system SLIM learns the style-linguistics dependency embeddings from various types of bonafide speech using self-supervised contrastive learning. The learned embeddings help to discriminate spoof from bonafide speech by focusing on the relationship between the style and linguistics aspects. We evaluated our system on ASVspoof5, ASV2019, and In-the-wild. Our submission achieved minDCF of 0.1499 and EER of 5.5% on ASVspoof5 Track 1, and EER of 7.4% and 10.8% on ASV2019 and In-the-wild respectively.
