Table of Contents
Fetching ...

Phoneme-Level Feature Discrepancies: A Key to Detecting Sophisticated Speech Deepfakes

Kuiyuan Zhang, Zhongyun Hua, Rushi Lan, Yushu Zhang, Yifang Guo

TL;DR

This work addresses the challenge of detecting sophisticated speech deepfakes by focusing on inconsistencies in phoneme-level features rather than isolated phonemes. It introduces adaptive phoneme pooling to create sample-specific phoneme-level representations from frame-level features, and employs a graph attention network to capture temporal dependencies among phonemes, augmented by random phoneme substitution to boost training diversity. A frozen phoneme recognizer, a copied Transformer, and a GAT form the core detector, with a CLIP-based loss guiding semantic alignment alongside a BCE classification objective. Across multiple benchmarks and languages, the method achieves state-of-the-art results and demonstrates strong generalization and robustness to noise and compression, offering a practical pathway for real-world deepfake detection.

Abstract

Recent advancements in text-to-speech and speech conversion technologies have enabled the creation of highly convincing synthetic speech. While these innovations offer numerous practical benefits, they also cause significant security challenges when maliciously misused. Therefore, there is an urgent need to detect these synthetic speech signals. Phoneme features provide a powerful speech representation for deepfake detection. However, previous phoneme-based detection approaches typically focused on specific phonemes, overlooking temporal inconsistencies across the entire phoneme sequence. In this paper, we develop a new mechanism for detecting speech deepfakes by identifying the inconsistencies of phoneme-level speech features. We design an adaptive phoneme pooling technique that extracts sample-specific phoneme-level features from frame-level speech data. By applying this technique to features extracted by pre-trained audio models on previously unseen deepfake datasets, we demonstrate that deepfake samples often exhibit phoneme-level inconsistencies when compared to genuine speech. To further enhance detection accuracy, we propose a deepfake detector that uses a graph attention network to model the temporal dependencies of phoneme-level features. Additionally, we introduce a random phoneme substitution augmentation technique to increase feature diversity during training. Extensive experiments on four benchmark datasets demonstrate the superior performance of our method over existing state-of-the-art detection methods.

Phoneme-Level Feature Discrepancies: A Key to Detecting Sophisticated Speech Deepfakes

TL;DR

This work addresses the challenge of detecting sophisticated speech deepfakes by focusing on inconsistencies in phoneme-level features rather than isolated phonemes. It introduces adaptive phoneme pooling to create sample-specific phoneme-level representations from frame-level features, and employs a graph attention network to capture temporal dependencies among phonemes, augmented by random phoneme substitution to boost training diversity. A frozen phoneme recognizer, a copied Transformer, and a GAT form the core detector, with a CLIP-based loss guiding semantic alignment alongside a BCE classification objective. Across multiple benchmarks and languages, the method achieves state-of-the-art results and demonstrates strong generalization and robustness to noise and compression, offering a practical pathway for real-world deepfake detection.

Abstract

Recent advancements in text-to-speech and speech conversion technologies have enabled the creation of highly convincing synthetic speech. While these innovations offer numerous practical benefits, they also cause significant security challenges when maliciously misused. Therefore, there is an urgent need to detect these synthetic speech signals. Phoneme features provide a powerful speech representation for deepfake detection. However, previous phoneme-based detection approaches typically focused on specific phonemes, overlooking temporal inconsistencies across the entire phoneme sequence. In this paper, we develop a new mechanism for detecting speech deepfakes by identifying the inconsistencies of phoneme-level speech features. We design an adaptive phoneme pooling technique that extracts sample-specific phoneme-level features from frame-level speech data. By applying this technique to features extracted by pre-trained audio models on previously unseen deepfake datasets, we demonstrate that deepfake samples often exhibit phoneme-level inconsistencies when compared to genuine speech. To further enhance detection accuracy, we propose a deepfake detector that uses a graph attention network to model the temporal dependencies of phoneme-level features. Additionally, we introduce a random phoneme substitution augmentation technique to increase feature diversity during training. Extensive experiments on four benchmark datasets demonstrate the superior performance of our method over existing state-of-the-art detection methods.

Paper Structure

This paper contains 39 sections, 6 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: T-SNE cluster results. We first employ a pre-trained audio model, Wav2Vec2, to extract the frame-level speech features (last hidden states) from the IT and PL subsets of the MLAAD, a multilingual deepfake speech dataset. The phoneme-level features are generated from frame-level features using adaptive phoneme pooling (see Fig. \ref{['fig:reduce_phoneme']}).
  • Figure 2: Adaptive phoneme pooling process. Consecutive frames with the same phoneme label in the frame-level feature will combined (averaged) into a vector.
  • Figure 3: Training of multilingual phoneme recognition model and T-SNE cluster results of phoneme-level speech features. After training the multilingual phoneme recognition model, we employ it to generate phoneme labels and then generate phoneme-level features from the multilingual frame-level features extracted by Wav2Vec2 and WavLM. T-SNE visualization results demonstrate that phoneme-level features effectively discriminate between real and fake samples.
  • Figure 4: Overview of our deepfake detection model. Given the input feature, our model first uses a pre-trained phoneme-recognition model to predict frame phonemes, then uses a copied Transformer to learn frame-level speech features, next employs GAT to capture temporal dependencies of phoneme-level speech features, and finally makes classification.