Table of Contents
Fetching ...

Whisper-PMFA: Partial Multi-Scale Feature Aggregation for Speaker Verification using Whisper Models

Yiyang Zhao, Shuai Wang, Guangzhi Sun, Zehua Chen, Chao Zhang, Mingxing Xu, Thomas Fang Zheng

TL;DR

This work investigates improving speaker verification by leveraging a pre-trained Whisper ASR encoder. It introduces Whisper-PMFA, which performs partial multi-scale feature aggregation by selecting a subset of Whisper encoder blocks, followed by attentive pooling to obtain robust speaker embeddings, and explores LoRA-based parameter-efficient adaptation to reduce training costs. The approach demonstrates strong, cross-lingual performance on VoxCeleb1 and CN-Celeb1, achieving a notable EER of 1.42% on VoxCeleb1-O and significant improvements over baselines, while LoRA reduces trainable parameters by about 45× with only a small EER increment. These results highlight the viability of using large pre-trained ASR representations for SV tasks and the practicality of parameter-efficient fine-tuning in multilingual settings.

Abstract

In this paper, Whisper, a large-scale pre-trained model for automatic speech recognition, is proposed to apply to speaker verification. A partial multi-scale feature aggregation (PMFA) approach is proposed based on a subset of Whisper encoder blocks to derive highly discriminative speaker embeddings.Experimental results demonstrate that using the middle to later blocks of the Whisper encoder keeps more speaker information. On the VoxCeleb1 and CN-Celeb1 datasets, our system achieves 1.42% and 8.23% equal error rates (EERs) respectively, receiving 0.58% and 1.81% absolute EER reductions over the ECAPA-TDNN baseline, and 0.46% and 0.97% over the ResNet34 baseline. Furthermore, our results indicate that using Whisper models trained on multilingual data can effectively enhance the model's robustness across languages. Finally, the low-rank adaptation approach is evaluated, which reduces the trainable model parameters by approximately 45 times while only slightly increasing EER by 0.2%.

Whisper-PMFA: Partial Multi-Scale Feature Aggregation for Speaker Verification using Whisper Models

TL;DR

This work investigates improving speaker verification by leveraging a pre-trained Whisper ASR encoder. It introduces Whisper-PMFA, which performs partial multi-scale feature aggregation by selecting a subset of Whisper encoder blocks, followed by attentive pooling to obtain robust speaker embeddings, and explores LoRA-based parameter-efficient adaptation to reduce training costs. The approach demonstrates strong, cross-lingual performance on VoxCeleb1 and CN-Celeb1, achieving a notable EER of 1.42% on VoxCeleb1-O and significant improvements over baselines, while LoRA reduces trainable parameters by about 45× with only a small EER increment. These results highlight the viability of using large pre-trained ASR representations for SV tasks and the practicality of parameter-efficient fine-tuning in multilingual settings.

Abstract

In this paper, Whisper, a large-scale pre-trained model for automatic speech recognition, is proposed to apply to speaker verification. A partial multi-scale feature aggregation (PMFA) approach is proposed based on a subset of Whisper encoder blocks to derive highly discriminative speaker embeddings.Experimental results demonstrate that using the middle to later blocks of the Whisper encoder keeps more speaker information. On the VoxCeleb1 and CN-Celeb1 datasets, our system achieves 1.42% and 8.23% equal error rates (EERs) respectively, receiving 0.58% and 1.81% absolute EER reductions over the ECAPA-TDNN baseline, and 0.46% and 0.97% over the ResNet34 baseline. Furthermore, our results indicate that using Whisper models trained on multilingual data can effectively enhance the model's robustness across languages. Finally, the low-rank adaptation approach is evaluated, which reduces the trainable model parameters by approximately 45 times while only slightly increasing EER by 0.2%.
Paper Structure (18 sections, 2 equations, 2 figures, 4 tables)

This paper contains 18 sections, 2 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: The overall architecture of Whisper-PMFA, where $S$ denotes the index of the initial Whisper block selected for feature aggregation, and $E$ represents the index of the final Whisper block selected.
  • Figure 2: LoRA. The pretrained weight parameters $W$ are frozen, with only $A$ and $B$ being updated.