Table of Contents
Fetching ...

Targeted Augmented Data for Audio Deepfake Detection

Marcella Astrid, Enjie Ghorbel, Djamila Aouada

TL;DR

The paper tackles the robustness gap in audio deepfake detectors, which tend to overfit to known manipulations. It introduces gradient-based, boundary-targeted augmentation that perturbs real inputs toward the model's decision boundary with $\mathbf{p} = - \epsilon \cdot \text{sign}(\nabla_{\mathbf{x}} \mathcal{L}(\hat{\mathbf{y}}^o, \tilde{\mathbf{y}}))$ where $\epsilon \in [\epsilon_{\text{min}}, \epsilon_{\text{max}}]$ and $\tilde{\mathbf{y}} = [0.5,0.5]$, labeling augmented samples as fake and mixing them into training with probability $p$. This architecture-agnostic approach was evaluated on two detectors, AASIST and RawNet2, using the ASVspoof 2019 LA dataset, and yielded improved generalization as shown by lower min t-DCF and EER on unseen attacks. Ablation results indicate that targeting ambiguous predictions near the decision boundary provides the strongest gains, compared to untargeted or confidently fake-targeted augmentations. The work demonstrates a practical, data-centric path to combat overfitting in audio deepfake detection and motivates exploring additional adversarial augmentation strategies.

Abstract

The availability of highly convincing audio deepfake generators highlights the need for designing robust audio deepfake detectors. Existing works often rely solely on real and fake data available in the training set, which may lead to overfitting, thereby reducing the robustness to unseen manipulations. To enhance the generalization capabilities of audio deepfake detectors, we propose a novel augmentation method for generating audio pseudo-fakes targeting the decision boundary of the model. Inspired by adversarial attacks, we perturb original real data to synthesize pseudo-fakes with ambiguous prediction probabilities. Comprehensive experiments on two well-known architectures demonstrate that the proposed augmentation contributes to improving the generalization capabilities of these architectures.

Targeted Augmented Data for Audio Deepfake Detection

TL;DR

The paper tackles the robustness gap in audio deepfake detectors, which tend to overfit to known manipulations. It introduces gradient-based, boundary-targeted augmentation that perturbs real inputs toward the model's decision boundary with where and , labeling augmented samples as fake and mixing them into training with probability . This architecture-agnostic approach was evaluated on two detectors, AASIST and RawNet2, using the ASVspoof 2019 LA dataset, and yielded improved generalization as shown by lower min t-DCF and EER on unseen attacks. Ablation results indicate that targeting ambiguous predictions near the decision boundary provides the strongest gains, compared to untargeted or confidently fake-targeted augmentations. The work demonstrates a practical, data-centric path to combat overfitting in audio deepfake detection and motivates exploring additional adversarial augmentation strategies.

Abstract

The availability of highly convincing audio deepfake generators highlights the need for designing robust audio deepfake detectors. Existing works often rely solely on real and fake data available in the training set, which may lead to overfitting, thereby reducing the robustness to unseen manipulations. To enhance the generalization capabilities of audio deepfake detectors, we propose a novel augmentation method for generating audio pseudo-fakes targeting the decision boundary of the model. Inspired by adversarial attacks, we perturb original real data to synthesize pseudo-fakes with ambiguous prediction probabilities. Comprehensive experiments on two well-known architectures demonstrate that the proposed augmentation contributes to improving the generalization capabilities of these architectures.
Paper Structure (10 sections, 4 equations, 5 figures, 3 tables)

This paper contains 10 sections, 4 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of the proposed augmentation method for increasing the diversity of fake data. Both the original fake data and the augmented data are labeled as fake. The combined dataset is utilized during training. The x-axis denotes time, while the y-axis represents the audio signal magnitude.
  • Figure 2: Illustration of two different augmentation strategies: (a) augmentation of fake data in the neighborhood of the decision boundary (between real and fake data); and (b) augmentation in the neighborhood of fake data without considering the decision boundary. As compared to (b), the strategy employed in (a) can better enhance the generalization to unseen fake.
  • Figure 3: Visualization of an audio sample: (a) without augmentation, (b) with augmentation targeting ambiguous predictions (ours) (c) with augmentation targeting fake predictions, and (d) with untargeted augmentation. Perturbations with approximately similar magnitude are selected for visualization purposes: $\epsilon=0.1$ for (b) and (c), and $\sigma=0.05$ for (d).
  • Figure 4: The EER (%) under different settings. The blue dotted line represents the baseline, i.e., the model trained without augmentation, while the red solid line depicts the model trained with our proposed augmentation technique. Various ranges of hyperparameter values are explored: (a) the probability of augmented data $p$ on RawNet2 (with fixed $\epsilon_{\text{min}}=0.01$ and $\epsilon_{\text{max}}=0.5$), (b) the maximum perturbation strength $\epsilon_{\text{max}}$ on RawNet2 (with fixed $p=0.7$ and $\epsilon_{\text{min}}=0.01$), (c) the minimum perturbation strength $\epsilon_{\text{min}}$ on RawNet2 (with fixed $p=0.7$ and $\epsilon_{\text{max}}=0.5$), (d) the probability of augmented data $p$ on AASIST (with fixed $\epsilon_{\text{min}}=0.01$ and $\epsilon_{\text{max}}=0.5$), and (e) the maximum perturbation strength $\epsilon_{\text{max}}$ on AASIST (with fixed $p=0.1$ and $\epsilon_{\text{min}}=0.01$). Lower EER values indicate better performance.
  • Figure 5: Samples of augmented data with varying perturbation magnitude. Augmentation with a higher magnitude results in augmented data that are more distinct from the original sample.