Table of Contents
Fetching ...

Unmasking Deepfakes: Leveraging Augmentations and Features Variability for Deepfake Speech Detection

Inbal Rimon, Oren Gal, Haim Permuter

TL;DR

The paper addresses deepfake speech detection under evolving generation techniques by proposing a hybrid training framework that fuses self-supervised pretraining with supervised end-to-end learning. It introduces two-stage masking (MaskedSpec and MaskedFeature) and compression-aware SSL to increase robustness, along with a hybrid pipeline using Wav2Vec2 and ResNet34. Empirical results show state-of-the-art performance on ASVSpoof5 Track 1 and strong cross-dataset results, with significant gains from model fusion across diverse pretraining and augmentation configurations. The findings highlight the value of augmentation diversity and domain-specific pretraining for robust deepfake detection, while also acknowledging persistent cross-domain generalization challenges. The work suggests future directions toward unified, domain-invariant representations to improve practicality in real-world deployments.

Abstract

Deepfake speech detection presents a growing challenge as generative audio technologies continue to advance. We propose a hybrid training framework that advances detection performance through novel augmentation strategies. First, we introduce a dual-stage masking approach that operates both at the spectrogram level (MaskedSpec) and within the latent feature space (MaskedFeature), providing complementary regularization that improves tolerance to localized distortions and enhances generalization learning. Second, we introduce compression-aware strategy during self-supervised to increase variability in low-resource scenarios while preserving the integrity of learned representations, thereby improving the suitability of pretrained features for deepfake detection. The framework integrates a learnable self-supervised feature extractor with a ResNet classification head in a unified training pipeline, enabling joint adaptation of acoustic representations and discriminative patterns. On the ASVspoof5 Challenge (Track~1), the system achieves state-of-the-art results with an Equal Error Rate (EER) of 4.08% under closed conditions, further reduced to 2.71% through fusion of models with diverse pretrained feature extractors. when trained on ASVspoof2019, our system obtaining leading performance on the ASVspoof2019 evaluation set (0.18% EER) and the ASVspoof2021 DF task (2.92% EER).

Unmasking Deepfakes: Leveraging Augmentations and Features Variability for Deepfake Speech Detection

TL;DR

The paper addresses deepfake speech detection under evolving generation techniques by proposing a hybrid training framework that fuses self-supervised pretraining with supervised end-to-end learning. It introduces two-stage masking (MaskedSpec and MaskedFeature) and compression-aware SSL to increase robustness, along with a hybrid pipeline using Wav2Vec2 and ResNet34. Empirical results show state-of-the-art performance on ASVSpoof5 Track 1 and strong cross-dataset results, with significant gains from model fusion across diverse pretraining and augmentation configurations. The findings highlight the value of augmentation diversity and domain-specific pretraining for robust deepfake detection, while also acknowledging persistent cross-domain generalization challenges. The work suggests future directions toward unified, domain-invariant representations to improve practicality in real-world deployments.

Abstract

Deepfake speech detection presents a growing challenge as generative audio technologies continue to advance. We propose a hybrid training framework that advances detection performance through novel augmentation strategies. First, we introduce a dual-stage masking approach that operates both at the spectrogram level (MaskedSpec) and within the latent feature space (MaskedFeature), providing complementary regularization that improves tolerance to localized distortions and enhances generalization learning. Second, we introduce compression-aware strategy during self-supervised to increase variability in low-resource scenarios while preserving the integrity of learned representations, thereby improving the suitability of pretrained features for deepfake detection. The framework integrates a learnable self-supervised feature extractor with a ResNet classification head in a unified training pipeline, enabling joint adaptation of acoustic representations and discriminative patterns. On the ASVspoof5 Challenge (Track~1), the system achieves state-of-the-art results with an Equal Error Rate (EER) of 4.08% under closed conditions, further reduced to 2.71% through fusion of models with diverse pretrained feature extractors. when trained on ASVspoof2019, our system obtaining leading performance on the ASVspoof2019 evaluation set (0.18% EER) and the ASVspoof2021 DF task (2.92% EER).
Paper Structure (28 sections, 5 equations, 2 figures, 13 tables)

This paper contains 28 sections, 5 equations, 2 figures, 13 tables.

Figures (2)

  • Figure 1: Overview of the proposed hybrid training framework. In the first stage (top, dark-shaded blocks), raw audio is used to pretrain the feature extractor using a self-supervised objective. In the second stage (bottom, light-shaded blocks), the model is trained end-to-end for the task of deepfake speech detection. Augmentations are applied at multiple points in the pipeline to boost tolerance to variability and to enrich the training signal under low-resource condition. The feature extractor is initialized with weights from the pretraining phase, as indicated by the downward arrow.
  • Figure 2: Visualization of the different mask types used in our experiments: Squares, Bands, Singles, and Gauss. White regions indicate the masking value ($\mu_{\text{stft}}$), while black regions represent the original, unmasked values.