Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR

Ajinkya Kulkarni; Sandipana Dowerah; Atharva Kulkarni; Tanel Alumäe; Mathew Magimai Doss

Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR

Ajinkya Kulkarni, Sandipana Dowerah, Atharva Kulkarni, Tanel Alumäe, Mathew Magimai Doss

TL;DR

It is shown that multilingual HuBERT pre-training is the primary driver of cross-domain robustness, enabling 100M models to match larger and commercial systems, and introduces a test-time augmentation protocol with perturbation-based aleatoric uncertainty to expose calibration differences invisible to standard metrics.

Abstract

Self-supervised learning (SSL) underpins modern audio deepfake detection, yet most prior work centers on a single large wav2vec2-XLSR backbone, leaving compact under studied. We present RAPTOR, Representation Aware Pairwise-gated Transformer for Out-of-domain Recognition a controlled study of compact SSL backbones from the HuBERT and WavLM within a unified pairwise-gated fusion detector, evaluated across 14 cross-domain benchmarks. We show that multilingual HuBERT pre-training is the primary driver of cross-domain robustness, enabling 100M models to match larger and commercial systems. Beyond EER, we introduce a test-time augmentation protocol with perturbation-based aleatoric uncertainty to expose calibration differences invisible to standard metrics: WavLM variants exhibit overconfident miscalibration under perturbation, whereas iterative mHuBERT remains stable. These findings indicate that SSL pre-training trajectory, not model scale, drives reliable audio deepfake detection.

Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR

TL;DR

Abstract

Paper Structure (16 sections, 5 equations, 2 figures, 4 tables)

This paper contains 16 sections, 5 equations, 2 figures, 4 tables.

Introduction
Method
Compact SSL Backbone Families
RAPTOR: Unified Layer-Fusion Detector
TTA-Based Uncertainty Estimation
Experimental Setup
Datasets and Training Protocols
Implementation Details
Evaluation Protocols
Results and Analysis
SSL Pre-Training Trajectory and Cross-Domain Robustness
Compact 100M Systems vs. Large-Scale and Commercial Models
TTA-Based Uncertainty and Confidence Calibration
Discussion
Conclusion
...and 1 more sections

Figures (2)

Figure 1: RAPTOR framework. SSL layer representations are progressively fused by pairwise and hierarchical softmax gates, followed by attention pooling and a binary classifier.
Figure 2: Pairwise gate maps $\alpha_{p,1}(t)$ for a spoofed utterance from ITW produced by mHuBERT-Iter2. The $x$-axis denotes time frames (50 ms resolution), the $y$-axis the SSL layer-pair index ($p=1\ldots6$). Spoof utterances activate lower-to-middle layer pairs (indices 2--4) more strongly, suggesting synthesis artifacts concentrate at earlier stages of the SSL hierarchy.

Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR

TL;DR

Abstract

Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR

Authors

TL;DR

Abstract

Table of Contents

Figures (2)