SLIM: Style-Linguistics Mismatch Model for Generalized Audio Deepfake Detection
Yi Zhu, Surya Koppisetti, Trang Tran, Gaurav Bharaj
TL;DR
SLIM tackles generalization and explainability gaps in audio deepfake detection by explicitly modelling the style–linguistics mismatch. It uses a two-stage learning framework: Stage 1 performs one-class self-supervised contrastive training on real speech to learn style and linguistics dependencies and generates dependency features, while Stage 2 fuses these features with reduced SSL embeddings to train a binary real/fake classifier. The approach yields strong out-of-domain performance on In-the-wild and MLAAD-EN while remaining competitive in-domain, and provides interpretable indicators of mismatch that help explain model decisions. Importantly, SLIM achieves this without extra labeled data or end-to-end fine-tuning, offering practical benefits for deployment and trust in ADD systems.
Abstract
Audio deepfake detection (ADD) is crucial to combat the misuse of speech synthesized from generative AI models. Existing ADD models suffer from generalization issues, with a large performance discrepancy between in-domain and out-of-domain data. Moreover, the black-box nature of existing models limits their use in real-world scenarios, where explanations are required for model decisions. To alleviate these issues, we introduce a new ADD model that explicitly uses the StyleLInguistics Mismatch (SLIM) in fake speech to separate them from real speech. SLIM first employs self-supervised pretraining on only real samples to learn the style-linguistics dependency in the real class. The learned features are then used in complement with standard pretrained acoustic features (e.g., Wav2vec) to learn a classifier on the real and fake classes. When the feature encoders are frozen, SLIM outperforms benchmark methods on out-of-domain datasets while achieving competitive results on in-domain data. The features learned by SLIM allow us to quantify the (mis)match between style and linguistic content in a sample, hence facilitating an explanation of the model decision.
