Table of Contents
Fetching ...

Learn from Real: Reality Defender's Submission to ASVspoof5 Challenge

Yi Zhu, Chirag Goel, Surya Koppisetti, Trang Tran, Ankur Kumar, Gaurav Bharaj

TL;DR

This work tackles robust audio deepfake detection under unseen attacks by addressing generalization with a low-cost training regime. It introduces SLIM, a two-stage framework where Stage 1 uses self-supervised contrastive learning to capture style-linguistics dependencies from real speech, and Stage 2 fuses these dependency embeddings with raw SSL features for supervised discrimination. The method achieves a strong ASVspoof5 Track 1 result with a minDCF of $0.1499$ and EER of $5.56\%$, and demonstrates notable generalization to out-of-domain data (ASV2019 LA and ITW) with EERs of $7.4\%$ and $10.8\%$, respectively, while maintaining a relatively small training footprint (~$7$ million trainable parameters and <$15$ hours). Ablation studies suggest further gains are possible with focal loss and codec-specific augmentations, and the approach reduces the generalization gap compared to baselines. Overall, SLIM provides a practical, generalizable, and computation-efficient avenue for deepfake speech detection.

Abstract

Audio deepfake detection is crucial to combat the malicious use of AI-synthesized speech. Among many efforts undertaken by the community, the ASVspoof challenge has become one of the benchmarks to evaluate the generalizability and robustness of detection models. In this paper, we present Reality Defender's submission to the ASVspoof5 challenge, highlighting a novel pretraining strategy which significantly improves generalizability while maintaining low computational cost during training. Our system SLIM learns the style-linguistics dependency embeddings from various types of bonafide speech using self-supervised contrastive learning. The learned embeddings help to discriminate spoof from bonafide speech by focusing on the relationship between the style and linguistics aspects. We evaluated our system on ASVspoof5, ASV2019, and In-the-wild. Our submission achieved minDCF of 0.1499 and EER of 5.5% on ASVspoof5 Track 1, and EER of 7.4% and 10.8% on ASV2019 and In-the-wild respectively.

Learn from Real: Reality Defender's Submission to ASVspoof5 Challenge

TL;DR

This work tackles robust audio deepfake detection under unseen attacks by addressing generalization with a low-cost training regime. It introduces SLIM, a two-stage framework where Stage 1 uses self-supervised contrastive learning to capture style-linguistics dependencies from real speech, and Stage 2 fuses these dependency embeddings with raw SSL features for supervised discrimination. The method achieves a strong ASVspoof5 Track 1 result with a minDCF of and EER of , and demonstrates notable generalization to out-of-domain data (ASV2019 LA and ITW) with EERs of and , respectively, while maintaining a relatively small training footprint (~ million trainable parameters and < hours). Ablation studies suggest further gains are possible with focal loss and codec-specific augmentations, and the approach reduces the generalization gap compared to baselines. Overall, SLIM provides a practical, generalizable, and computation-efficient avenue for deepfake speech detection.

Abstract

Audio deepfake detection is crucial to combat the malicious use of AI-synthesized speech. Among many efforts undertaken by the community, the ASVspoof challenge has become one of the benchmarks to evaluate the generalizability and robustness of detection models. In this paper, we present Reality Defender's submission to the ASVspoof5 challenge, highlighting a novel pretraining strategy which significantly improves generalizability while maintaining low computational cost during training. Our system SLIM learns the style-linguistics dependency embeddings from various types of bonafide speech using self-supervised contrastive learning. The learned embeddings help to discriminate spoof from bonafide speech by focusing on the relationship between the style and linguistics aspects. We evaluated our system on ASVspoof5, ASV2019, and In-the-wild. Our submission achieved minDCF of 0.1499 and EER of 5.5% on ASVspoof5 Track 1, and EER of 7.4% and 10.8% on ASV2019 and In-the-wild respectively.

Paper Structure

This paper contains 17 sections, 3 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Two-stage training framework of SLIM. Stage 1 extracts style and linguistics representations from frozen SSL encoders, projects them into a lower-dimensional space, and aims to minimize the distance between the projected representations as well as the intra-subspace redundancy. The Stage 1 embeddings, style embeddings (output from style encoder) and linguistics embeddings (output from linguistics encoder) are concatenated in Stage 2 to learn a classifier via supervised training. Architecture of the projector network and classifier can be found in Appendix. \ref{['appendix:2']}
  • Figure 2: Breakdown of system performance (minDCF) on ASV5 eval dataset.
  • Figure 3: Distribution of NISQA MOS for ASV5 train, dev, and 40k samples from eval.
  • Figure 4: Architecture of the projector network with input and output dimensions. Input $\mathbf{X_{L,F,T}}$ represents the original subspace representation encoded by the SSL frontend, where $L$ denotes the transformer layer index, $F$ denotes the feature size, and $T$ denotes the number of time steps.