Mixture of Experts Fusion for Fake Audio Detection Using Frozen wav2vec 2.0

Zhiyong Wang; Ruibo Fu; Zhengqi Wen; Jianhua Tao; Xiaopeng Wang; Yuankun Xie; Xin Qi; Shuchen Shi; Yi Lu; Yukun Liu; Chenxing Li; Xuefei Liu; Guanjun Li

Mixture of Experts Fusion for Fake Audio Detection Using Frozen wav2vec 2.0

Zhiyong Wang, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Xiaopeng Wang, Yuankun Xie, Xin Qi, Shuchen Shi, Yi Lu, Yukun Liu, Chenxing Li, Xuefei Liu, Guanjun Li

TL;DR

This work tackles fake audio detection under advanced synthesis by avoiding fine-tuning of pretrained backbones. It introduces a Mixture of Experts fusion (MoE fusion) that freezes wav2vec 2.0 and uses the last-layer representation $F_l^{24}$ to gate $N = n \times 24$ experts, each specializing in a specific wav2vec layer feature $F_l^i$, with the fused output passed to an AASIST classifier. Across ASVspoof 2019/2021 and ITW, the method achieves competitive $EER$ versus fine-tuned baselines and even the best performance on some eval sets, highlighting robust generalization. Ablation studies show that freezing the backbone under MoE fusion often outperforms fine-tuning and that increasing the number of experts or hidden dimensions does not universally improve results, suggesting that controlled, sparse, layer-aware fusion yields the best trade-off between performance and training efficiency.

Abstract

Speech synthesis technology has posed a serious threat to speaker verification systems. Currently, the most effective fake audio detection methods utilize pretrained models, and integrating features from various layers of pretrained model further enhances detection performance. However, most of the previously proposed fusion methods require fine-tuning the pretrained models, resulting in excessively long training times and hindering model iteration when facing new speech synthesis technology. To address this issue, this paper proposes a feature fusion method based on the Mixture of Experts, which extracts and integrates features relevant to fake audio detection from layer features, guided by a gating network based on the last layer feature, while freezing the pretrained model. Experiments conducted on the ASVspoof2019 and ASVspoof2021 datasets demonstrate that the proposed method achieves competitive performance compared to those requiring fine-tuning.

Mixture of Experts Fusion for Fake Audio Detection Using Frozen wav2vec 2.0

TL;DR

to gate

experts, each specializing in a specific wav2vec layer feature

, with the fused output passed to an AASIST classifier. Across ASVspoof 2019/2021 and ITW, the method achieves competitive

versus fine-tuned baselines and even the best performance on some eval sets, highlighting robust generalization. Ablation studies show that freezing the backbone under MoE fusion often outperforms fine-tuning and that increasing the number of experts or hidden dimensions does not universally improve results, suggesting that controlled, sparse, layer-aware fusion yields the best trade-off between performance and training efficiency.

Abstract

Paper Structure (12 sections, 3 equations, 1 figure, 3 tables)

This paper contains 12 sections, 3 equations, 1 figure, 3 tables.

Introduction
proposed method
wav2vec 2.0 Model
Mixture of Experts Fusion
experiments and results
Datasets and Metrics
Implementation Details
Results and Analysis
Performance Comparison
Ablation Study
Mixture of Experts Configuration
conclusion

Figures (1)

Figure 1: The architecture of FAD architecture (a), details in the MoE fusion module (b) and details in the expert (c).

Mixture of Experts Fusion for Fake Audio Detection Using Frozen wav2vec 2.0

TL;DR

Abstract

Mixture of Experts Fusion for Fake Audio Detection Using Frozen wav2vec 2.0

Authors

TL;DR

Abstract

Table of Contents

Figures (1)