Table of Contents
Fetching ...

Pitch Imperfect: Detecting Audio Deepfakes Through Acoustic Prosodic Analysis

Kevin Warren, Daniel Olszewski, Seth Layton, Kevin Butler, Carrie Gates, Patrick Traynor

TL;DR

This work proposes a prosody-based approach to detecting audio deepfakes by leveraging high-level linguistic features such as fundamental frequency, intonation, jitter, shimmer, and harmonic-to-noise ratio. By training an LSTM-based detector on six prosodic features and enhancing interpretability with attention, the authors achieve 93% accuracy and an EER of 24.7% on the ASVspoof2021 deepfake track, comparable to contemporary baselines. They also test robustness against an adaptive $L_{ abla_fty}$-norm attack, showing that baseline spectral detectors are vulnerable while the prosody-based model remains largely stable, highlighting the value of linguistic features for long-term resilience. The study further demonstrates explainability by identifying the most influential prosodic cues (notably jitter, shimmer, and mean $F_0$) and provides a companion website with examples, underscoring practical implications for forensic and security applications and encouraging future integration of prosody in deepfake defense. This work thus marks a step toward combining linguistics and machine learning to build more robust, interpretable defenses against evolving audio deepfake threats.

Abstract

Audio deepfakes are increasingly in-differentiable from organic speech, often fooling both authentication systems and human listeners. While many techniques use low-level audio features or optimization black-box model training, focusing on the features that humans use to recognize speech will likely be a more long-term robust approach to detection. We explore the use of prosody, or the high-level linguistic features of human speech (e.g., pitch, intonation, jitter) as a more foundational means of detecting audio deepfakes. We develop a detector based on six classical prosodic features and demonstrate that our model performs as well as other baseline models used by the community to detect audio deepfakes with an accuracy of 93% and an EER of 24.7%. More importantly, we demonstrate the benefits of using a linguistic features-based approach over existing models by applying an adaptive adversary using an $L_{\infty}$ norm attack against the detectors and using attention mechanisms in our training for explainability. We show that we can explain the prosodic features that have highest impact on the model's decision (Jitter, Shimmer and Mean Fundamental Frequency) and that other models are extremely susceptible to simple $L_{\infty}$ norm attacks (99.3% relative degradation in accuracy). While overall performance may be similar, we illustrate the robustness and explainability benefits to a prosody feature approach to audio deepfake detection.

Pitch Imperfect: Detecting Audio Deepfakes Through Acoustic Prosodic Analysis

TL;DR

This work proposes a prosody-based approach to detecting audio deepfakes by leveraging high-level linguistic features such as fundamental frequency, intonation, jitter, shimmer, and harmonic-to-noise ratio. By training an LSTM-based detector on six prosodic features and enhancing interpretability with attention, the authors achieve 93% accuracy and an EER of 24.7% on the ASVspoof2021 deepfake track, comparable to contemporary baselines. They also test robustness against an adaptive -norm attack, showing that baseline spectral detectors are vulnerable while the prosody-based model remains largely stable, highlighting the value of linguistic features for long-term resilience. The study further demonstrates explainability by identifying the most influential prosodic cues (notably jitter, shimmer, and mean ) and provides a companion website with examples, underscoring practical implications for forensic and security applications and encouraging future integration of prosody in deepfake defense. This work thus marks a step toward combining linguistics and machine learning to build more robust, interpretable defenses against evolving audio deepfake threats.

Abstract

Audio deepfakes are increasingly in-differentiable from organic speech, often fooling both authentication systems and human listeners. While many techniques use low-level audio features or optimization black-box model training, focusing on the features that humans use to recognize speech will likely be a more long-term robust approach to detection. We explore the use of prosody, or the high-level linguistic features of human speech (e.g., pitch, intonation, jitter) as a more foundational means of detecting audio deepfakes. We develop a detector based on six classical prosodic features and demonstrate that our model performs as well as other baseline models used by the community to detect audio deepfakes with an accuracy of 93% and an EER of 24.7%. More importantly, we demonstrate the benefits of using a linguistic features-based approach over existing models by applying an adaptive adversary using an norm attack against the detectors and using attention mechanisms in our training for explainability. We show that we can explain the prosodic features that have highest impact on the model's decision (Jitter, Shimmer and Mean Fundamental Frequency) and that other models are extremely susceptible to simple norm attacks (99.3% relative degradation in accuracy). While overall performance may be similar, we illustrate the robustness and explainability benefits to a prosody feature approach to audio deepfake detection.

Paper Structure

This paper contains 45 sections, 7 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: The standard generation for audio deepfakes is an encoder/synthesizer/vocoder pipeline. The encoder generates a voice embedding, the synthesizer creates a spectrogram for a target phrase, and the vocoder converts the spectrogram into a synthetic waveform.
  • Figure 2: A voice conversion pipeline for a two-speaker setup. A model is trained on the features of both the target and source speakers. Using the trained model, the technique takes in a source waveform, transforms it, and then synthesizes the target waveform.
  • Figure 3: The pipeline for processing speech sample features and the final LSTM model architecture used as our detector. Pipeline (a) demonstrates the audio preprocessing steps for feature extraction and batch generation used to train our LSTM architecture. The model architecture in (b) shows the size and activation units for each hidden layer. The details to the architecture allow the model to be reproduced to duplicate or verify results. Combined, this pipeline and model form our classification process.
  • Figure 4: Examples of the spectrogram and fundamental frequency sequences for an organic and synthetic audio sample. The top graph is an organic speaker. The bottom graph is a deepfake trained on the same organic speaker and generated to say the same sentence. Highlighted generation issues illustrate (1) inflection changes, (2) pause discrepancies, and (3,4) combinations of inflection changes, pause discrepancies, and pitch variance.
  • Figure 5: AUROC/EER for each model architecture tested against the ASVspoof2021 validation dataset.
  • ...and 4 more figures