Every Breath You Don't Take: Deepfake Speech Detection Using Breath
Seth Layton, Thiago De Andrade, Daniel Olszewski, Kevin Warren, Kevin Butler, Patrick Traynor
TL;DR
This work tackles the threat of deepfake speech by proposing a high-level feature approach based on breaths. It develops a breath detector trained on in-the-wild datasets (podcasts and news articles) and demonstrates that simple breath-derived features (breaths per minute, duration, spacing) can perfectly discriminate real vs deepfake speech in the wild, outperforming a state-of-the-art SSL-based model. The authors release their in-the-wild dataset and code, and show that complex deep learning models relying on low-level spectral cues can fail on unseen data, underscoring the practical value of high-level prosodic features. The findings suggest breath-aware detectors as robust, generation-agnostic defenses, with future work integrating linguistic context to further reduce false positives as deepfakes evolve.
Abstract
Deepfake speech represents a real and growing threat to systems and society. Many detectors have been created to aid in defense against speech deepfakes. While these detectors implement myriad methodologies, many rely on low-level fragments of the speech generation process. We hypothesize that breath, a higher-level part of speech, is a key component of natural speech and thus improper generation in deepfake speech is a performant discriminator. To evaluate this, we create a breath detector and leverage this against a custom dataset of online news article audio to discriminate between real/deepfake speech. Additionally, we make this custom dataset publicly available to facilitate comparison for future work. Applying our simple breath detector as a deepfake speech discriminator on in-the-wild samples allows for accurate classification (perfect 1.0 AUPRC and 0.0 EER on test data) across 33.6 hours of audio. We compare our model with the state-of-the-art SSL-wav2vec model and show that this complex deep learning model completely fails to classify the same in-the-wild samples (0.72 AUPRC and 0.99 EER).
