Generalized Fake Audio Detection via Deep Stable Learning

Zhiyong Wang; Ruibo Fu; Zhengqi Wen; Yuankun Xie; Yukun Liu; Xiaopeng Wang; Xuefei Liu; Yongwei Li; Jianhua Tao; Yi Lu; Xin Qi; Shuchen Shi

Generalized Fake Audio Detection via Deep Stable Learning

Zhiyong Wang, Ruibo Fu, Zhengqi Wen, Yuankun Xie, Yukun Liu, Xiaopeng Wang, Xuefei Liu, Yongwei Li, Jianhua Tao, Yi Lu, Xin Qi, Shuchen Shi

TL;DR

This work tackles distribution shift in fake audio detection (FAD) by introducing SWL, a stable-learning training scheme that decorrelates selected input features via Random Fourier Features (RFF) and learns sample weights without requiring extra data. The SWL module acts as a plug-in to existing FAD models, iteratively optimizing a weighted loss and a decorrelation objective $I_{AB} = \| \hat{\Sigma}_{AB} \|_F^2$ across RFF mappings, with global weights approximated through a memory-augmented fusion strategy using $Z'_{G_i}$ and $w'_{G_i}$ and $\alpha = 0.9$. Evaluated on ASVspoof 2019 LA training data and tested on three distribution-shifted sets from ASVspoof 2021 (LA/DF), SWL improves generalization for multiple base models (AASIST, RawNet2, TSSD) without extra data, and analysis shows spectral features (2N-S) and higher numbers of RFF mappings yield stronger cross-distribution robustness. This approach offers a practical, plug-in solution for robust FAD deployment in diverse real-world conditions.

Abstract

Although current fake audio detection approaches have achieved remarkable success on specific datasets, they often fail when evaluated with datasets from different distributions. Previous studies typically address distribution shift by focusing on using extra data or applying extra loss restrictions during training. However, these methods either require a substantial amount of data or complicate the training process. In this work, we propose a stable learning-based training scheme that involves a Sample Weight Learning (SWL) module, addressing distribution shift by decorrelating all selected features via learning weights from training samples. The proposed portable plug-in-like SWL is easy to apply to multiple base models and generalizes them without using extra data during training. Experiments conducted on the ASVspoof datasets clearly demonstrate the effectiveness of SWL in generalizing different models across three evaluation datasets from different distributions.

Generalized Fake Audio Detection via Deep Stable Learning

TL;DR

across RFF mappings, with global weights approximated through a memory-augmented fusion strategy using

and

. Evaluated on ASVspoof 2019 LA training data and tested on three distribution-shifted sets from ASVspoof 2021 (LA/DF), SWL improves generalization for multiple base models (AASIST, RawNet2, TSSD) without extra data, and analysis shows spectral features (2N-S) and higher numbers of RFF mappings yield stronger cross-distribution robustness. This approach offers a practical, plug-in solution for robust FAD deployment in diverse real-world conditions.

Abstract

Paper Structure (13 sections, 8 equations, 2 figures, 3 tables)

This paper contains 13 sections, 8 equations, 2 figures, 3 tables.

Introduction
Proposed Method
Sample weighting with RFF
Iteratively learn global sample weights
Experiments
Datasets and Evaluation metrics
Experimental setup
Results and Analyses
SWL generalizes base FAD models
More RFF mapping functions, higher generalization
What combination of nodes is better for decorrelation
Conclusions
Acknowledgements

Figures (2)

Figure 1: The overall architecture of the proposed stable learning based method. LSWD refers to Learning Sample Weighting for Decorrelation as described in Section \ref{['sec:sample-weight']}. The number of RFF mapping fuctions and hidden state feature are flexible to be adjusted. In the training stage, we only need to feed the selected hidden state feature into SWL module and multiply the computed sample weights with the Weighted Cross-Entropy (WCE) loss. In the inference phase, the model directly conduct prediction without calculation of sample weights.
Figure 2: Applying SWL to AASIST-L using different numbers of RFF mapping functions. The dashed line represents the performance of AASIST-L without applying SWL.

Generalized Fake Audio Detection via Deep Stable Learning

TL;DR

Abstract

Generalized Fake Audio Detection via Deep Stable Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (2)