Table of Contents
Fetching ...

Measuring the Robustness of Audio Deepfake Detectors

Xiang Li, Pin-Yu Chen, Wenqi Wei

TL;DR

This work systematically evaluates the robustness of 10 audio deepfake detectors against 16 real-world corruptions, spanning noise, modifications, and compression. It demonstrates that foundation models generally outperform traditional methods in robustness, with larger models offering gains that plateau at scale, and shows that data augmentation further enhances resilience to unseen perturbations. Neural codecs emerge as a primary threat to detection reliability, highlighting the need for feature representations that are robust to codec-induced distortions. The political-speech case study confirms the practical value of foundation models for reliable detection in real-world, high-stakes scenarios. Overall, the study advocates corruption-aware evaluation and continued development of robust, efficient detectors for practical deployment.

Abstract

Deepfakes have become a universal and rapidly intensifying concern of generative AI across various media types such as images, audio, and videos. Among these, audio deepfakes have been of particular concern due to the ease of high-quality voice synthesis and distribution via platforms such as social media and robocalls. Consequently, detecting audio deepfakes plays a critical role in combating the growing misuse of AI-synthesized speech. However, real-world scenarios often introduce various audio corruptions, such as noise, modification, and compression, that may significantly impact detection performance. This work systematically evaluates the robustness of 10 audio deepfake detection models against 16 common corruptions, categorized into noise perturbation, audio modification, and compression. Using both traditional deep learning models and state-of-the-art foundation models, we make four unique observations. First, our findings show that while most models demonstrate strong robustness to noise, they are notably more vulnerable to modifications and compression, especially when neural codecs are applied. Second, speech foundation models generally outperform traditional models across most scenarios, likely due to their self-supervised learning paradigm and large-scale pre-training. Third, our results show that increasing model size improves robustness, albeit with diminishing returns. Fourth, we demonstrate how targeted data augmentation during training can enhance model resilience to unseen perturbations. A case study on political speech deepfakes highlights the effectiveness of foundation models in achieving high accuracy under real-world conditions. These findings emphasize the importance of developing more robust detection frameworks to ensure reliability in practical deployment settings.

Measuring the Robustness of Audio Deepfake Detectors

TL;DR

This work systematically evaluates the robustness of 10 audio deepfake detectors against 16 real-world corruptions, spanning noise, modifications, and compression. It demonstrates that foundation models generally outperform traditional methods in robustness, with larger models offering gains that plateau at scale, and shows that data augmentation further enhances resilience to unseen perturbations. Neural codecs emerge as a primary threat to detection reliability, highlighting the need for feature representations that are robust to codec-induced distortions. The political-speech case study confirms the practical value of foundation models for reliable detection in real-world, high-stakes scenarios. Overall, the study advocates corruption-aware evaluation and continued development of robust, efficient detectors for practical deployment.

Abstract

Deepfakes have become a universal and rapidly intensifying concern of generative AI across various media types such as images, audio, and videos. Among these, audio deepfakes have been of particular concern due to the ease of high-quality voice synthesis and distribution via platforms such as social media and robocalls. Consequently, detecting audio deepfakes plays a critical role in combating the growing misuse of AI-synthesized speech. However, real-world scenarios often introduce various audio corruptions, such as noise, modification, and compression, that may significantly impact detection performance. This work systematically evaluates the robustness of 10 audio deepfake detection models against 16 common corruptions, categorized into noise perturbation, audio modification, and compression. Using both traditional deep learning models and state-of-the-art foundation models, we make four unique observations. First, our findings show that while most models demonstrate strong robustness to noise, they are notably more vulnerable to modifications and compression, especially when neural codecs are applied. Second, speech foundation models generally outperform traditional models across most scenarios, likely due to their self-supervised learning paradigm and large-scale pre-training. Third, our results show that increasing model size improves robustness, albeit with diminishing returns. Fourth, we demonstrate how targeted data augmentation during training can enhance model resilience to unseen perturbations. A case study on political speech deepfakes highlights the effectiveness of foundation models in achieving high accuracy under real-world conditions. These findings emphasize the importance of developing more robust detection frameworks to ensure reliability in practical deployment settings.

Paper Structure

This paper contains 11 sections, 10 figures.

Figures (10)

  • Figure 1: (a) The evaluation framework encompasses three types of audio corruptions: noise perturbation, modification, and compression, including 16 specific corruption techniques (there are 4 types of neural codecs and 2 types of trans codecs). (b) 10 state-of-the-art detection models, which leverage various types of audio features, such as Mel-spectrograms, Linear Frequency Cepstral Coefficients (LFCC), spectrograms, or raw waveforms, are evaluated. These models employ diverse architectures, including convolutional networks, graph attention modules, and foundation models. (c) The evaluation demonstrates that these detection models exhibit strong robustness to noise perturbation but are significantly more vulnerable to audio modification and compression. Larger model sizes generally improve robustness, and incorporating data augmentation techniques can significantly enhance model performance under common corruptions.
  • Figure 2: Robustness against noise perturbation across varying signal-to-noise ratio (SNR). The green-shaded regions represent SNR levels where audio quality is deemed acceptable with $\text{ViSQOL} \geq 3$. The ViSQOL scores for different corruption types at different severity levels can be found in Appendix \ref{['appendix:audio_quality']}. As illustrated, detection performance improves with increasing SNR, as less noises preserve higher audio quality. Performance significantly deteriorates at low SNR levels (e.g., 5 dB), emphasizing the challenges posed by heavy noise corruption. Among the models, foundation models such as Wave2Vec2BERT are better suited for deployment in noisy environments, maintaining high AUROC and low EER even in severe noise conditions. In contrast, traditional models like LFCC-LCNN and ResNet_Spec show notable performance drops under noisy conditions.
  • Figure 3: Robustness against various types of modification. The results show that detection performance generally declines as modifications alter critical spectral or temporal features, even when perceptual audio quality remains unaffected. Foundation models, such as Wave2Vec2BERT and HuBERT, demonstrate stronger robustness across most modifications, while traditional models are more vulnerable to changes, particularly in time and frequency domains, suggesting the limitations of current detection methods of handling a wide range of spectral and temporal distortions for robust real-world performance. In addition, most models are robust against replay conditions. However, RawNet2 experiences significant performance degradation.
  • Figure 4: Robustness against various types of audio compression. While higher audio quality generally improves detection performance, compression often introduces subtle artifacts that significantly degrade model robustness. Notably, even codecs like Encodec, which maintain high perceptual audio quality at low bandwidth, result in significant performance drops for detection models. This highlights a disconnect between human-perceived quality and model sensitivity to compression artifacts. Although foundation models exhibit better robustness compared to traditional models, their performance still declines under extreme compression conditions (e.g., low MP3 bitrates or narrow Encodec bandwidths). These findings underscore the importance of designing detection models capable of learning robust and semantically meaningful features that are less sensitive to compression distortions, given the widespread use of audio compression in real-world applications.
  • Figure 5: (a) Robustness of different scales of Whisper models under noise perturbations, modifications, and compression. The radial distances represent the average accuracy for each corruption type. Larger models, like Whisper-large and Whisper-medium, consistently achieve higher accuracy and more balanced robustness across all corruption types than smaller models, such as Whisper-tiny and Whisper-base, which are more sensitive to certain corruptions. While increasing model size enhances robustness and generalization by learning more robust audio features, the computational and storage demands of larger models necessitate careful consideration in practical applications to balance performance and resource constraints. (b) Robustness of Wave2Vec2BERT trained w/ and w/o data augmentation against common corruptions. While the model without augmentation already shows strong robustness to noise perturbations, incorporating augmentation further improves detection accuracy and reduces performance variance, highlighting the effectiveness of data augmentation in improving the adaptability of audio deepfake detection models for practical deployment scenarios. (c) Detection accuracy on deepfake political speech. Foundation models, such as HuBERT and Wave2Vec2BERT consistently outperform traditional models.
  • ...and 5 more figures