Table of Contents
Fetching ...

Genuine-Focused Learning using Mask AutoEncoder for Generalized Fake Audio Detection

Xiaopeng Wang, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Yuankun Xie, Yukun Liu, Jianhua Tao, Xuefei Liu, Yongwei Li, Xin Qi, Yi Lu, Shuchen Shi

TL;DR

This work addresses the challenge of generalizing fake audio detection to unseen spoofing techniques by introducing Genuine-Focused Learning (GFL-FAD). The framework leverages a Mask AutoEncoder (MAE) encoder-decoder to create Counterfactual Reasoning Enhanced Representations (CRER) of genuine audio, augmented by a Genuine Audio Reconstruction (GAR) loss that concentrates learning on genuine patterns, and by fusing MAE-derived Bottleneck (BN) features with CRER via attention. The approach achieves state-of-the-art performance on ASVspoof2019 LA, reporting an EER of $0.25$% and demonstrating robustness through ablations that highlight the contributions of GAR loss, BN features, and CRER. This work advances practical anti-spoofing by improving generalization to novel spoofing methods and suggests future extensions to partially spoofed speech and broader speech-security contexts.

Abstract

The generalization of Fake Audio Detection (FAD) is critical due to the emergence of new spoofing techniques. Traditional FAD methods often focus solely on distinguishing between genuine and known spoofed audio. We propose a Genuine-Focused Learning (GFL) framework guided, aiming for highly generalized FAD, called GFL-FAD. This method incorporates a Counterfactual Reasoning Enhanced Representation (CRER) based on audio reconstruction using the Mask AutoEncoder (MAE) architecture to accurately model genuine audio features. To reduce the influence of spoofed audio during training, we introduce a genuine audio reconstruction loss, maintaining the focus on learning genuine data features. In addition, content-related bottleneck (BN) features are extracted from the MAE to supplement the knowledge of the original audio. These BN features are adaptively fused with CRER to further improve robustness. Our method achieves state-of-the-art performance with an EER of 0.25% on ASVspoof2019 LA.

Genuine-Focused Learning using Mask AutoEncoder for Generalized Fake Audio Detection

TL;DR

This work addresses the challenge of generalizing fake audio detection to unseen spoofing techniques by introducing Genuine-Focused Learning (GFL-FAD). The framework leverages a Mask AutoEncoder (MAE) encoder-decoder to create Counterfactual Reasoning Enhanced Representations (CRER) of genuine audio, augmented by a Genuine Audio Reconstruction (GAR) loss that concentrates learning on genuine patterns, and by fusing MAE-derived Bottleneck (BN) features with CRER via attention. The approach achieves state-of-the-art performance on ASVspoof2019 LA, reporting an EER of % and demonstrating robustness through ablations that highlight the contributions of GAR loss, BN features, and CRER. This work advances practical anti-spoofing by improving generalization to novel spoofing methods and suggests future extensions to partially spoofed speech and broader speech-security contexts.

Abstract

The generalization of Fake Audio Detection (FAD) is critical due to the emergence of new spoofing techniques. Traditional FAD methods often focus solely on distinguishing between genuine and known spoofed audio. We propose a Genuine-Focused Learning (GFL) framework guided, aiming for highly generalized FAD, called GFL-FAD. This method incorporates a Counterfactual Reasoning Enhanced Representation (CRER) based on audio reconstruction using the Mask AutoEncoder (MAE) architecture to accurately model genuine audio features. To reduce the influence of spoofed audio during training, we introduce a genuine audio reconstruction loss, maintaining the focus on learning genuine data features. In addition, content-related bottleneck (BN) features are extracted from the MAE to supplement the knowledge of the original audio. These BN features are adaptively fused with CRER to further improve robustness. Our method achieves state-of-the-art performance with an EER of 0.25% on ASVspoof2019 LA.
Paper Structure (14 sections, 7 equations, 3 figures, 4 tables)

This paper contains 14 sections, 7 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Illustration of the overall architecture for the proposed GFL-FAD. We start by segmenting patches from the spectral representation. A portion of these patches are encoded by the MAE encoder, resulting in BN features. The remaining patches are masked and concatenated with the BN features before being fed to the MAE decoder for spectrogram reconstruction. We refer to the reconstructed spectral features as CRER. If the input audio is genuine, an additional GAR Loss is computed to maintain focus on modeling genuine audio knowledge. Through fusion, the CRER is combined with the BN features and passed to the back-end classification network for final detection.
  • Figure 2: Visualization of High-Dimensional Representations of Genuine and Spoofed Audio Samples using T-SNE
  • Figure 3: Performance of GFL-FAD at Different Mask Ratios