Table of Contents
Fetching ...

Easy, Interpretable, Effective: openSMILE for voice deepfake detection

Octavian Pascu, Dan Oneata, Horia Cucu, Nicolas M. Müller

TL;DR

The paper tackles anti-spoofing for ASVspoof5 by exploiting interpretable, scalar openSMILE features (from eGeMAPSv2) to detect voice deepfakes generated by various TTS systems. It shows that single features can yield strong in-domain detection, with low equal error rates for some attacks, but cross-domain generalization is limited and highly dependent on architectural similarity among TTS models. A direct neural-front-end comparison (Wav2Vec2) reveals that aggregated neural features can outperform openSMILE overall, though openSMILE maintains advantages in interpretability and attack-specific fingerprints. The work highlights the existence of TTS fingerprints that aid explainable anti-spoofing yet pose challenges for generalization across unseen architectures, suggesting a path toward hybrid, interpretable-plus-robust approaches for real-world deployment.

Abstract

In this paper, we demonstrate that attacks in the latest ASVspoof5 dataset -- a de facto standard in the field of voice authenticity and deepfake detection -- can be identified with surprising accuracy using a small subset of very simplistic features. These are derived from the openSMILE library, and are scalar-valued, easy to compute, and human interpretable. For example, attack A10`s unvoiced segments have a mean length of 0.09 +- 0.02, while bona fide instances have a mean length of 0.18 +- 0.07. Using this feature alone, a threshold classifier achieves an Equal Error Rate (EER) of 10.3% for attack A10. Similarly, across all attacks, we achieve up to 0.8% EER, with an overall EER of 15.7 +- 6.0%. We explore the generalization capabilities of these features and find that some of them transfer effectively between attacks, primarily when the attacks originate from similar Text-to-Speech (TTS) architectures. This finding may indicate that voice anti-spoofing is, in part, a problem of identifying and remembering signatures or fingerprints of individual TTS systems. This allows to better understand anti-spoofing models and their challenges in real-world application.

Easy, Interpretable, Effective: openSMILE for voice deepfake detection

TL;DR

The paper tackles anti-spoofing for ASVspoof5 by exploiting interpretable, scalar openSMILE features (from eGeMAPSv2) to detect voice deepfakes generated by various TTS systems. It shows that single features can yield strong in-domain detection, with low equal error rates for some attacks, but cross-domain generalization is limited and highly dependent on architectural similarity among TTS models. A direct neural-front-end comparison (Wav2Vec2) reveals that aggregated neural features can outperform openSMILE overall, though openSMILE maintains advantages in interpretability and attack-specific fingerprints. The work highlights the existence of TTS fingerprints that aid explainable anti-spoofing yet pose challenges for generalization across unseen architectures, suggesting a path toward hybrid, interpretable-plus-robust approaches for real-world deployment.

Abstract

In this paper, we demonstrate that attacks in the latest ASVspoof5 dataset -- a de facto standard in the field of voice authenticity and deepfake detection -- can be identified with surprising accuracy using a small subset of very simplistic features. These are derived from the openSMILE library, and are scalar-valued, easy to compute, and human interpretable. For example, attack A10`s unvoiced segments have a mean length of 0.09 +- 0.02, while bona fide instances have a mean length of 0.18 +- 0.07. Using this feature alone, a threshold classifier achieves an Equal Error Rate (EER) of 10.3% for attack A10. Similarly, across all attacks, we achieve up to 0.8% EER, with an overall EER of 15.7 +- 6.0%. We explore the generalization capabilities of these features and find that some of them transfer effectively between attacks, primarily when the attacks originate from similar Text-to-Speech (TTS) architectures. This finding may indicate that voice anti-spoofing is, in part, a problem of identifying and remembering signatures or fingerprints of individual TTS systems. This allows to better understand anti-spoofing models and their challenges in real-world application.
Paper Structure (10 sections, 3 figures, 4 tables)

This paper contains 10 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Distribution of the openSMILE 'eGeMAPSv2' feature F85, 'MeanUnvoicedSegmentLength', computed for attack $A10$ and bona fide data from ASVspoof5. A simple threshold classifier obtains an EER of $10.3\%$ by predicting 'spoof' if $F85 < 0.12$, else 'bona fide' (dotted line).
  • Figure 2: Visualization of the ASVspoof5 dataset's 'train' and 'dev' partition. BT and BD correspond to the respective bona fide data, while $A01$ through $A16$ correspond to the individual attacks. Grey boxes on the left indicate the naming convention we use in subsequent experiments.
  • Figure 3: Distribution plot comparing the performance of scalar-valued features extracted using openSMILE (blue) and Wav2Vec2 (orange) across attacks $A01$ to $A16$. The $x$-axis represents the EER obtained by each feature, and the $y$-axis denotes frequency. The plot reveals that openSMILE features sometimes exhibit higher predictive accuracy, as evidenced by a greater concentration of distributional mass near an $EER=0$.