Table of Contents
Fetching ...

I Can Hear You: Selective Robust Training for Deepfake Audio Detection

Zirui Zhang, Wei Hao, Aroon Sankoh, William Lin, Emanuel Mendiola-Ortiz, Junfeng Yang, Chengzhi Mao

TL;DR

This work establishes the largest public voice dataset to date, named DeepFakeVox-HQ, comprising 1.3 million samples, including 270,000 high-quality deepfake samples from 14 diverse sources, and proposes the F-SAT: Frequency-Selective Adversarial Training method focusing on high-frequency components.

Abstract

Recent advances in AI-generated voices have intensified the challenge of detecting deepfake audio, posing risks for scams and the spread of disinformation. To tackle this issue, we establish the largest public voice dataset to date, named DeepFakeVox-HQ, comprising 1.3 million samples, including 270,000 high-quality deepfake samples from 14 diverse sources. Despite previously reported high accuracy, existing deepfake voice detectors struggle with our diversely collected dataset, and their detection success rates drop even further under realistic corruptions and adversarial attacks. We conduct a holistic investigation into factors that enhance model robustness and show that incorporating a diversified set of voice augmentations is beneficial. Moreover, we find that the best detection models often rely on high-frequency features, which are imperceptible to humans and can be easily manipulated by an attacker. To address this, we propose the F-SAT: Frequency-Selective Adversarial Training method focusing on high-frequency components. Empirical results demonstrate that using our training dataset boosts baseline model performance (without robust training) by 33%, and our robust training further improves accuracy by 7.7% on clean samples and by 29.3% on corrupted and attacked samples, over the state-of-the-art RawNet3 model.

I Can Hear You: Selective Robust Training for Deepfake Audio Detection

TL;DR

This work establishes the largest public voice dataset to date, named DeepFakeVox-HQ, comprising 1.3 million samples, including 270,000 high-quality deepfake samples from 14 diverse sources, and proposes the F-SAT: Frequency-Selective Adversarial Training method focusing on high-frequency components.

Abstract

Recent advances in AI-generated voices have intensified the challenge of detecting deepfake audio, posing risks for scams and the spread of disinformation. To tackle this issue, we establish the largest public voice dataset to date, named DeepFakeVox-HQ, comprising 1.3 million samples, including 270,000 high-quality deepfake samples from 14 diverse sources. Despite previously reported high accuracy, existing deepfake voice detectors struggle with our diversely collected dataset, and their detection success rates drop even further under realistic corruptions and adversarial attacks. We conduct a holistic investigation into factors that enhance model robustness and show that incorporating a diversified set of voice augmentations is beneficial. Moreover, we find that the best detection models often rely on high-frequency features, which are imperceptible to humans and can be easily manipulated by an attacker. To address this, we propose the F-SAT: Frequency-Selective Adversarial Training method focusing on high-frequency components. Empirical results demonstrate that using our training dataset boosts baseline model performance (without robust training) by 33%, and our robust training further improves accuracy by 7.7% on clean samples and by 29.3% on corrupted and attacked samples, over the state-of-the-art RawNet3 model.

Paper Structure

This paper contains 19 sections, 14 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: The distribution of deepfake samples over predicted scores using the state-of-the-art detector jung2022pushing trained on the In-the-Wild dataset muller2022does, with a decision boundary at 0.5. Tests on original, corrupted, attacked, and real-world deepfake audio reveal significant shifts in prediction scores, highlighting that training solely on current public datasets without robust training methods leads to poor performance.
  • Figure 2: We apply a high-pass filter to audio samples to remove low-frequency components. The x-axis represents the center frequency of the filter applied. Notably, there is a marked decline in detection performance for real audio starting at 4000 Hz and for fake audio at 6000 Hz.
  • Figure 3: Performance of the RawNet3 baseline model on various datasets. 'Ours (train, w/o new)' represents our training dataset after removing all high-quality deepfake samples.
  • Figure 4: F-SAT Pipeline
  • Figure 5: Overview of various corruption types and adversarial attack strategies affecting audio robustness. The diagram categorizes different forms of corruptions (e.g., noise, filtering, distortion) and adversarial attacks (e.g., white-box, black-box) based on their methods, objectives, and scope of perturbation. This framework outlines the challenges in ensuring the robustness of audio systems against both environmental corruption and intentional adversarial manipulation.
  • ...and 5 more figures