Table of Contents
Fetching ...

Unsupervised Domain Adaptation for Audio Deepfake Detection with Modular Statistical Transformations

Urawee Thani, Gagandeep Singh, Priyanka Singh

TL;DR

A modular pipeline for unsupervised domain adaptation that combines pre-trained Wav2Vec 2.0 embeddings with statistical transformations to improve cross-domain generalization without requiring labeled target data is presented.

Abstract

Audio deepfake detection systems trained on one dataset often fail when deployed on data from different sources due to distributional shifts in recording conditions, synthesis methods, and acoustic environments. We present a modular pipeline for unsupervised domain adaptation that combines pre-trained Wav2Vec 2.0 embeddings with statistical transformations to improve cross-domain generalization without requiring labeled target data. Our approach applies power transformation for feature normalization, ANOVA-based feature selection, joint PCA for domain-agnostic dimensionality reduction, and CORAL alignment to match source and target covariance structures before classification via logistic regression. We evaluate on two cross-domain transfer scenarios: ASVspoof 2019 LA to Fake-or-Real (FoR) and FoR to ASVspoof, achieving 62.7--63.6\% accuracy with balanced performance across real and fake classes. Systematic ablation experiments reveal that feature selection (+3.5%) and CORAL alignment (+3.2%) provide the largest individual contributions, with the complete pipeline improving accuracy by 10.7% over baseline. While performance is modest compared to within-domain detection (94-96%), our pipeline offers transparency and modularity, making it suitable for deployment scenarios requiring interpretable decisions.

Unsupervised Domain Adaptation for Audio Deepfake Detection with Modular Statistical Transformations

TL;DR

A modular pipeline for unsupervised domain adaptation that combines pre-trained Wav2Vec 2.0 embeddings with statistical transformations to improve cross-domain generalization without requiring labeled target data is presented.

Abstract

Audio deepfake detection systems trained on one dataset often fail when deployed on data from different sources due to distributional shifts in recording conditions, synthesis methods, and acoustic environments. We present a modular pipeline for unsupervised domain adaptation that combines pre-trained Wav2Vec 2.0 embeddings with statistical transformations to improve cross-domain generalization without requiring labeled target data. Our approach applies power transformation for feature normalization, ANOVA-based feature selection, joint PCA for domain-agnostic dimensionality reduction, and CORAL alignment to match source and target covariance structures before classification via logistic regression. We evaluate on two cross-domain transfer scenarios: ASVspoof 2019 LA to Fake-or-Real (FoR) and FoR to ASVspoof, achieving 62.7--63.6\% accuracy with balanced performance across real and fake classes. Systematic ablation experiments reveal that feature selection (+3.5%) and CORAL alignment (+3.2%) provide the largest individual contributions, with the complete pipeline improving accuracy by 10.7% over baseline. While performance is modest compared to within-domain detection (94-96%), our pipeline offers transparency and modularity, making it suitable for deployment scenarios requiring interpretable decisions.
Paper Structure (28 sections, 2 equations, 4 figures, 3 tables)

This paper contains 28 sections, 2 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Cross-Domain Audio Deepfake Detection Pipeline. Audio from source (ASVspoof) and target (FoR) datasets undergoes feature extraction (Wav2Vec 2.0), power transformation (Yeo--Johnson), feature selection (ANOVA), dimensionality reduction (Joint PCA n=256), and domain alignment (CORAL). The aligned features are classified via logistic regression for binary real/fake prediction. Arrow connections show the data flow from datasets through preprocessing stages to final predictions.
  • Figure 2: CORAL Domain Alignment Visualization. Top: Pre-alignment feature distributions show ASVspoof (blue) and FoR (orange) datasets with a large distributional gap. Bottom: Post-CORAL alignment ($\lambda = 10^{-6}$) reduces the inter-domain gap through covariance matching, creating overlapping feature spaces. Performance gains: Acc +7.0%, AUC +5.8%, EER -5.7%.
  • Figure 3: Baseline vs. Final Cross-Domain Performance. Striped bars represent baseline performance using raw Wav2Vec 2.0 features, while solid bars show results after applying power transform, feature selection (ANOVA), PCA reduction (n=256), and CORAL alignment. The final pipeline achieves consistent improvements across Accuracy, AUC, and EER metrics for both ASVspoof$\rightarrow$FoR and FoR$\rightarrow$ASVspoof transfer scenarios, with accuracy gains exceeding 10% in both directions.
  • Figure 4: Proposed Multimodal Architecture for Future Work. Audio (Wav2Vec 2.0) and video (ResNet-50) feature extraction branches would process inputs independently through power transform, feature selection (ANOVA), PCA reduction, and CORAL domain alignment. This is a hypothetical design for future implementation.