Enhancing Speaker Verification with w2v-BERT 2.0 and Knowledge Distillation guided Structured Pruning
Ze Li, Ming Cheng, Ming Li
TL;DR
This work leverages the large-scale self-supervised encoder w2v-BERT 2.0 to advance speaker verification. By combining MFA-based multi-layer feature aggregation with a Layer Adapter and LoRA for efficient fine-tuning, the approach achieves state-of-the-art EERs of $0.12\%$ on Vox1-O and $0.55\%$ on Vox1-H. A knowledge-distillation guided structured pruning framework reduces the model size by $80\%$ while incurring only $0.04\%$ additional EER, enhancing practicality for deployment. The method is validated on VoxCeleb1/2, VoxBlink2, and CN-Celeb datasets, and the released source code enables replication and further research in scalable SV systems.
Abstract
Large-scale self-supervised Pre-Trained Models (PTMs) have shown significant improvements in the speaker verification (SV) task by providing rich feature representations. In this paper, we utilize w2v-BERT 2.0, a model with approximately 600 million parameters trained on 4.5 million hours of unlabeled data across 143 languages, for the SV task. The MFA structure with Layer Adapter is employed to process the multi-layer feature outputs from the PTM and extract speaker embeddings. Additionally, we incorporate LoRA for efficient fine-tuning. Our model achieves state-of-the-art results with 0.12% and 0.55% EER on the Vox1-O and Vox1-H test sets, respectively. Furthermore, we apply knowledge distillation guided structured pruning, reducing the model size by 80% while achieving only a 0.04% EER degradation. Source code and models are released at https://github.com/ZXHY-82/w2v-BERT-2.0_SV.
