Table of Contents
Fetching ...

Every Breath You Don't Take: Deepfake Speech Detection Using Breath

Seth Layton, Thiago De Andrade, Daniel Olszewski, Kevin Warren, Kevin Butler, Patrick Traynor

TL;DR

This work tackles the threat of deepfake speech by proposing a high-level feature approach based on breaths. It develops a breath detector trained on in-the-wild datasets (podcasts and news articles) and demonstrates that simple breath-derived features (breaths per minute, duration, spacing) can perfectly discriminate real vs deepfake speech in the wild, outperforming a state-of-the-art SSL-based model. The authors release their in-the-wild dataset and code, and show that complex deep learning models relying on low-level spectral cues can fail on unseen data, underscoring the practical value of high-level prosodic features. The findings suggest breath-aware detectors as robust, generation-agnostic defenses, with future work integrating linguistic context to further reduce false positives as deepfakes evolve.

Abstract

Deepfake speech represents a real and growing threat to systems and society. Many detectors have been created to aid in defense against speech deepfakes. While these detectors implement myriad methodologies, many rely on low-level fragments of the speech generation process. We hypothesize that breath, a higher-level part of speech, is a key component of natural speech and thus improper generation in deepfake speech is a performant discriminator. To evaluate this, we create a breath detector and leverage this against a custom dataset of online news article audio to discriminate between real/deepfake speech. Additionally, we make this custom dataset publicly available to facilitate comparison for future work. Applying our simple breath detector as a deepfake speech discriminator on in-the-wild samples allows for accurate classification (perfect 1.0 AUPRC and 0.0 EER on test data) across 33.6 hours of audio. We compare our model with the state-of-the-art SSL-wav2vec model and show that this complex deep learning model completely fails to classify the same in-the-wild samples (0.72 AUPRC and 0.99 EER).

Every Breath You Don't Take: Deepfake Speech Detection Using Breath

TL;DR

This work tackles the threat of deepfake speech by proposing a high-level feature approach based on breaths. It develops a breath detector trained on in-the-wild datasets (podcasts and news articles) and demonstrates that simple breath-derived features (breaths per minute, duration, spacing) can perfectly discriminate real vs deepfake speech in the wild, outperforming a state-of-the-art SSL-based model. The authors release their in-the-wild dataset and code, and show that complex deep learning models relying on low-level spectral cues can fail on unseen data, underscoring the practical value of high-level prosodic features. The findings suggest breath-aware detectors as robust, generation-agnostic defenses, with future work integrating linguistic context to further reduce false positives as deepfakes evolve.

Abstract

Deepfake speech represents a real and growing threat to systems and society. Many detectors have been created to aid in defense against speech deepfakes. While these detectors implement myriad methodologies, many rely on low-level fragments of the speech generation process. We hypothesize that breath, a higher-level part of speech, is a key component of natural speech and thus improper generation in deepfake speech is a performant discriminator. To evaluate this, we create a breath detector and leverage this against a custom dataset of online news article audio to discriminate between real/deepfake speech. Additionally, we make this custom dataset publicly available to facilitate comparison for future work. Applying our simple breath detector as a deepfake speech discriminator on in-the-wild samples allows for accurate classification (perfect 1.0 AUPRC and 0.0 EER on test data) across 33.6 hours of audio. We compare our model with the state-of-the-art SSL-wav2vec model and show that this complex deep learning model completely fails to classify the same in-the-wild samples (0.72 AUPRC and 0.99 EER).
Paper Structure (31 sections, 5 figures, 1 table)

This paper contains 31 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: A visual representation for a segment of speech containing a breath using a $window\_length$ of 20ms and a $hop\_length$ of 2.5ms. During the spoken segments before and after the breath RMSE is at peak values while the ZCR is at minimum values. Immediately surrounding a breath is a non-voiced segment where the RMSE values drop and ZCR values rise, but then both move to a medium value during the breath. Additionally, the background mel spectrogram shows higher energy across all frequencies during spoken segments, medium energy at lower frequencies during breaths, and relatively little energy at all frequencies for silence.
  • Figure 2: A visual representation of the breath detection model architecture.
  • Figure 3: A visual representation of the final stage of the detection pipeline. We use/compare three different simple classifiers in the last stage to showcase the relative interchangeability of models for final prediction.
  • Figure 4: The baseline validation testing on all podcasts vs. the leave-one-out testing for each podcast and each speaker. Each point is a specific speaker/podcast as the validation set. We show that breath measured in this capacity is generalizable and thus useful as a deepfake speech discriminator.
  • Figure 5: We show that there is a clear distinction (i.e., virtually no overlap) between human-read and synthetically-generated news articles with respect to breath statistics. The only overlap present is in outliers from each type of speech.