Table of Contents
Fetching ...

The Affective Bridge: Unifying Feature Representations for Speech Deepfake Detection

Yupei Li, Chenyang Lyu, Longyue Wang, Weihua Luo, Kaifu Zhang, Björn W. Schuller

TL;DR

The paper tackles the fragmentation of feature representations in speech deepfake detection by introducing EmoBridge, a pre-training framework that aligns diverse feature domains into an emotion-centered representation. By training an emotion recognition task and freezing downstream components, EmoBridge preserves original features while injecting affective cues, evaluated across multiple datasets with SVM-based detectors. Results show consistent improvements in accuracy and reductions in EER, particularly for DL-derived features, and provide interpretable evidence that emotion cues help distinguish real from synthetic speech. This approach offers a scalable, human-interpretable path toward unified feature representations in deepfake detection, with potential applicability to future, more expressive synthetic speech challenges.

Abstract

Speech deepfake detection has been widely explored using low-level acoustic descriptors. However, each study tends to select different feature sets, making it difficult to establish a unified representation for the task. Moreover, such features are not intuitive for humans to perceive, as the distinction between bona fide and synthesized speech becomes increasingly subtle with the advancement of deepfake generation techniques. Emotion, on the other hand, remains a unique human attribute that current deepfake generator struggles to fully replicate, reflecting the gap toward true artificial general intelligence. Interestingly, many existing acoustic and semantic features have implicit correlations with emotion. For instance, speech features recognized by automatic speech recognition systems often varies naturally with emotional expression. Based on this insight, we propose a novel training framework that leverages emotion as a bridge between conventional deepfake features and emotion-oriented representations. Experiments on the widely used FakeOrReal and In-the-Wild datasets demonstrate consistent and substantial improvements in accuracy, up to approximately 6% and 2% increases, respectively, and in equal error rate (EER), showing reductions of up to about 4% and 1%, respectively, while achieving comparable results on ASVspoof2019. This approach provides a unified training strategy for all features and interpretable feature direction for deepfake detection while improving model performance through emotion-informed learning.

The Affective Bridge: Unifying Feature Representations for Speech Deepfake Detection

TL;DR

The paper tackles the fragmentation of feature representations in speech deepfake detection by introducing EmoBridge, a pre-training framework that aligns diverse feature domains into an emotion-centered representation. By training an emotion recognition task and freezing downstream components, EmoBridge preserves original features while injecting affective cues, evaluated across multiple datasets with SVM-based detectors. Results show consistent improvements in accuracy and reductions in EER, particularly for DL-derived features, and provide interpretable evidence that emotion cues help distinguish real from synthetic speech. This approach offers a scalable, human-interpretable path toward unified feature representations in deepfake detection, with potential applicability to future, more expressive synthetic speech challenges.

Abstract

Speech deepfake detection has been widely explored using low-level acoustic descriptors. However, each study tends to select different feature sets, making it difficult to establish a unified representation for the task. Moreover, such features are not intuitive for humans to perceive, as the distinction between bona fide and synthesized speech becomes increasingly subtle with the advancement of deepfake generation techniques. Emotion, on the other hand, remains a unique human attribute that current deepfake generator struggles to fully replicate, reflecting the gap toward true artificial general intelligence. Interestingly, many existing acoustic and semantic features have implicit correlations with emotion. For instance, speech features recognized by automatic speech recognition systems often varies naturally with emotional expression. Based on this insight, we propose a novel training framework that leverages emotion as a bridge between conventional deepfake features and emotion-oriented representations. Experiments on the widely used FakeOrReal and In-the-Wild datasets demonstrate consistent and substantial improvements in accuracy, up to approximately 6% and 2% increases, respectively, and in equal error rate (EER), showing reductions of up to about 4% and 1%, respectively, while achieving comparable results on ASVspoof2019. This approach provides a unified training strategy for all features and interpretable feature direction for deepfake detection while improving model performance through emotion-informed learning.

Paper Structure

This paper contains 7 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Emotion as a bridge framework. The pre-trained model for a specific task provides an original feature representation, where essential information is stored in the encoder. We further train the encoder, together with a fully connected layer, to perform an emotion recognition task, thereby fusing emotion-related features within the encoder (EmoBridge step). The outputs from each encoder layer, which now incorporate both original and emotion-based features, are then used as inputs to a classifier—such as a support vector machine (SVM)—for the final deepfake detection task.
  • Figure 2: Comparison of mean attention values of Whisper layers before (model_ori) and after emotion as a bridge (model_new).
  • Figure 3: t-SNE for SV (left) and ASR (right) visualizations of selected sample representations obtained before (model_ori) and after (model_new) Emobrigde. The $x\_y\_z$ label denotes the speaker, emotion, and content respectively; identical subscripts indicate the same speaker, emotion, or content.