Table of Contents
Fetching ...

SLIM: Style-Linguistics Mismatch Model for Generalized Audio Deepfake Detection

Yi Zhu, Surya Koppisetti, Trang Tran, Gaurav Bharaj

TL;DR

SLIM tackles generalization and explainability gaps in audio deepfake detection by explicitly modelling the style–linguistics mismatch. It uses a two-stage learning framework: Stage 1 performs one-class self-supervised contrastive training on real speech to learn style and linguistics dependencies and generates dependency features, while Stage 2 fuses these features with reduced SSL embeddings to train a binary real/fake classifier. The approach yields strong out-of-domain performance on In-the-wild and MLAAD-EN while remaining competitive in-domain, and provides interpretable indicators of mismatch that help explain model decisions. Importantly, SLIM achieves this without extra labeled data or end-to-end fine-tuning, offering practical benefits for deployment and trust in ADD systems.

Abstract

Audio deepfake detection (ADD) is crucial to combat the misuse of speech synthesized from generative AI models. Existing ADD models suffer from generalization issues, with a large performance discrepancy between in-domain and out-of-domain data. Moreover, the black-box nature of existing models limits their use in real-world scenarios, where explanations are required for model decisions. To alleviate these issues, we introduce a new ADD model that explicitly uses the StyleLInguistics Mismatch (SLIM) in fake speech to separate them from real speech. SLIM first employs self-supervised pretraining on only real samples to learn the style-linguistics dependency in the real class. The learned features are then used in complement with standard pretrained acoustic features (e.g., Wav2vec) to learn a classifier on the real and fake classes. When the feature encoders are frozen, SLIM outperforms benchmark methods on out-of-domain datasets while achieving competitive results on in-domain data. The features learned by SLIM allow us to quantify the (mis)match between style and linguistic content in a sample, hence facilitating an explanation of the model decision.

SLIM: Style-Linguistics Mismatch Model for Generalized Audio Deepfake Detection

TL;DR

SLIM tackles generalization and explainability gaps in audio deepfake detection by explicitly modelling the style–linguistics mismatch. It uses a two-stage learning framework: Stage 1 performs one-class self-supervised contrastive training on real speech to learn style and linguistics dependencies and generates dependency features, while Stage 2 fuses these features with reduced SSL embeddings to train a binary real/fake classifier. The approach yields strong out-of-domain performance on In-the-wild and MLAAD-EN while remaining competitive in-domain, and provides interpretable indicators of mismatch that help explain model decisions. Importantly, SLIM achieves this without extra labeled data or end-to-end fine-tuning, offering practical benefits for deployment and trust in ADD systems.

Abstract

Audio deepfake detection (ADD) is crucial to combat the misuse of speech synthesized from generative AI models. Existing ADD models suffer from generalization issues, with a large performance discrepancy between in-domain and out-of-domain data. Moreover, the black-box nature of existing models limits their use in real-world scenarios, where explanations are required for model decisions. To alleviate these issues, we introduce a new ADD model that explicitly uses the StyleLInguistics Mismatch (SLIM) in fake speech to separate them from real speech. SLIM first employs self-supervised pretraining on only real samples to learn the style-linguistics dependency in the real class. The learned features are then used in complement with standard pretrained acoustic features (e.g., Wav2vec) to learn a classifier on the real and fake classes. When the feature encoders are frozen, SLIM outperforms benchmark methods on out-of-domain datasets while achieving competitive results on in-domain data. The features learned by SLIM allow us to quantify the (mis)match between style and linguistic content in a sample, hence facilitating an explanation of the model decision.
Paper Structure (33 sections, 1 equation, 7 figures, 5 tables, 1 algorithm)

This paper contains 33 sections, 1 equation, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: SLIM: A two-stage training framework for ADD. Stage 1 extracts style and linguistics representations from frozen SSL encoders, compresses them, and aims to minimize the distance between the compressed representations ($\mathcal{L}_{cross}$), as well as the intra-subspace redundancy ($\mathcal{L}_{style}$ and $\mathcal{L}_{linguistics}$). The Stage 1 features and the original subspace representations (pretrained SSL embeddings) are combined in Stage 2 to learn a classifier via supervised training.
  • Figure 2: Cosine distance (log scale) calculated between the style and linguistics dependency features for ASVspoof2021 DF eval, In-the-wild, and MLAAD-EN. Whiskers from top to bottom represent the 75% quartile, median, and 25% quartile of the distribution.
  • Figure 3: Projected embeddings using t-SNE for style-linguistic representations: (a) subspace embeddings - real class, (b) subspace embeddings - fake class, (c) dependency features - real class, (d) dependency features - fake class. Data distributions are visualized on the upper and right side of the embedding plots. Red: ASVspoof2021; Green: In-the-wild; Blue: MLAAD-EN.
  • Figure 4: Mel-spectrograms of select samples from In-the-wild. SLIM classifies all four correctly, and when reporting fakes, provides guidance on abnormalities in style and/or linguistics. Also, the dependency and subspace features in SLIM are complementary to each other. Left: samples missed by dependency features but correctly identified by the style and linguistic features; right: vice versa.
  • Figure 5: Spearman correlation coefficients calculated across all layers from two pretrained Wav2vec-XLSR backbones. Blue highlights layers 0-10 from Wav2vec-SER to represent style information.Red highlights layers 14-21 from Wav2vec-ASR to represent linguistics information. The correlation values between the selected layers can be read from the overlapping region.
  • ...and 2 more figures