DeePen: Penetration Testing for Audio Deepfake Detection
Nicolas Müller, Piotr Kawa, Adriana Stan, Thien-Phuc Doan, Souhwan Jung, Wei Herng Choong, Philip Sperl, Konstantin Böttinger
TL;DR
DeePen targets the robustness of audio deepfake detectors by performing model-agnostic penetration testing using $K=17$ signal-processing attacks applied to $N$ samples per class from the ASVspoof 2019 and MLAAD datasets, producing $2 \times N \times K$ modified samples with a $50/50$ train/test split. The approach reveals that both open-source and commercial detectors are vulnerable to simple attacks (e.g., time-stretching, background music), and that adaptive retraining can improve resilience but cannot fully neutralize all attacks. A greedy defense-selection method identifies a minimal subset of defenses that achieves performance comparable to retraining on all defenses, providing practical guidance for robust defense design and offering insights into the learned audio features. The authors release DeePen code to enable ongoing robustness evaluation across detectors and attack scenarios, underscoring the need for continual security assessment in audio deepfake detection.
Abstract
Deepfakes - manipulated or forged audio and video media - pose significant security risks to individuals, organizations, and society at large. To address these challenges, machine learning-based classifiers are commonly employed to detect deepfake content. In this paper, we assess the robustness of such classifiers through a systematic penetration testing methodology, which we introduce as DeePen. Our approach operates without prior knowledge of or access to the target deepfake detection models. Instead, it leverages a set of carefully selected signal processing modifications - referred to as attacks - to evaluate model vulnerabilities. Using DeePen, we analyze both real-world production systems and publicly available academic model checkpoints, demonstrating that all tested systems exhibit weaknesses and can be reliably deceived by simple manipulations such as time-stretching or echo addition. Furthermore, our findings reveal that while some attacks can be mitigated by retraining detection systems with knowledge of the specific attack, others remain persistently effective. We release all associated code.
