Table of Contents
Fetching ...

Whisper-SV: Adapting Whisper for Low-data-resource Speaker Verification

Li Zhang, Ning Jiang, Qing Wang, Yue Li, Quan Lu, Lei Xie

TL;DR

This work tackles speaker verification under low-data-resource conditions by leveraging Whisper, a large multilingual speech foundation model, through a lightweight adaptor called Whisper-SV. Whisper-SV comprises four modules that select speaker-rich Whisper encoder layers and fuse their representations into a compact, discriminative embedding for SV, followed by a simple classifier trained with AAM-softmax loss. Across VoxCeleb1, FFSVC, and IMSV, Whisper-SV demonstrates strong data efficiency, outperforming baselines and SSL/domain adaptation approaches with a small parameter footprint and reasonable compute. The results show Whisper representations carry rich speaker cues that can be effectively distilled for SV, enabling practical deployment in low-resource domains and suggesting a path for future efficiency-focused improvements.

Abstract

Trained on 680,000 hours of massive speech data, Whisper is a multitasking, multilingual speech foundation model demonstrating superior performance in automatic speech recognition, translation, and language identification. However, its applicability in speaker verification (SV) tasks remains unexplored, particularly in low-data-resource scenarios where labeled speaker data in specific domains are limited. To fill this gap, we propose a lightweight adaptor framework to boost SV with Whisper, namely Whisper-SV. Given that Whisper is not specifically optimized for SV tasks, we introduce a representation selection module to quantify the speaker-specific characteristics contained in each layer of Whisper and select the top-k layers with prominent discriminative speaker features. To aggregate pivotal speaker-related features while diminishing non-speaker redundancies across the selected top-k distinct layers of Whisper, we design a multi-layer aggregation module in Whisper-SV to integrate multi-layer representations into a singular, compacted representation for SV. In the multi-layer aggregation module, we employ convolutional layers with shortcut connections among different layers to refine speaker characteristics derived from multi-layer representations from Whisper. In addition, an attention aggregation layer is used to reduce non-speaker interference and amplify speaker-specific cues for SV tasks. Finally, a simple classification module is used for speaker classification. Experiments on VoxCeleb1, FFSVC, and IMSV datasets demonstrate that Whisper-SV achieves EER/minDCF of 2.22%/0.307, 6.14%/0.488, and 7.50%/0.582, respectively, showing superior performance in low-data-resource SV scenarios.

Whisper-SV: Adapting Whisper for Low-data-resource Speaker Verification

TL;DR

This work tackles speaker verification under low-data-resource conditions by leveraging Whisper, a large multilingual speech foundation model, through a lightweight adaptor called Whisper-SV. Whisper-SV comprises four modules that select speaker-rich Whisper encoder layers and fuse their representations into a compact, discriminative embedding for SV, followed by a simple classifier trained with AAM-softmax loss. Across VoxCeleb1, FFSVC, and IMSV, Whisper-SV demonstrates strong data efficiency, outperforming baselines and SSL/domain adaptation approaches with a small parameter footprint and reasonable compute. The results show Whisper representations carry rich speaker cues that can be effectively distilled for SV, enabling practical deployment in low-resource domains and suggesting a path for future efficiency-focused improvements.

Abstract

Trained on 680,000 hours of massive speech data, Whisper is a multitasking, multilingual speech foundation model demonstrating superior performance in automatic speech recognition, translation, and language identification. However, its applicability in speaker verification (SV) tasks remains unexplored, particularly in low-data-resource scenarios where labeled speaker data in specific domains are limited. To fill this gap, we propose a lightweight adaptor framework to boost SV with Whisper, namely Whisper-SV. Given that Whisper is not specifically optimized for SV tasks, we introduce a representation selection module to quantify the speaker-specific characteristics contained in each layer of Whisper and select the top-k layers with prominent discriminative speaker features. To aggregate pivotal speaker-related features while diminishing non-speaker redundancies across the selected top-k distinct layers of Whisper, we design a multi-layer aggregation module in Whisper-SV to integrate multi-layer representations into a singular, compacted representation for SV. In the multi-layer aggregation module, we employ convolutional layers with shortcut connections among different layers to refine speaker characteristics derived from multi-layer representations from Whisper. In addition, an attention aggregation layer is used to reduce non-speaker interference and amplify speaker-specific cues for SV tasks. Finally, a simple classification module is used for speaker classification. Experiments on VoxCeleb1, FFSVC, and IMSV datasets demonstrate that Whisper-SV achieves EER/minDCF of 2.22%/0.307, 6.14%/0.488, and 7.50%/0.582, respectively, showing superior performance in low-data-resource SV scenarios.
Paper Structure (25 sections, 10 equations, 6 figures, 10 tables)

This paper contains 25 sections, 10 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: The architecture of Whisper-SV includes four modules: (a) a pre-trained Whisper module for providing robust and generalized representations, (b) a representation selection module for selecting the top-k layers containing significant speaker-specific characteristics, (c) a multi-layer aggregation module to aggregate representations from multiple layers of Whisper, and (d) a speaker classifier module for speaker classification.
  • Figure 2: Experimental results of ECAPA-TDNN trained with representations extracted from each layer of Whisper. The red dots indicate the top four lowest combined EER and minDCF ((EER +10*minDCF)/2).
  • Figure 3: Experimental results of ECAPA-TDNN and Whisper-SV trained with different proportions of training data (The training duration (hours) corresponding to the value in parentheses on the x-axis).
  • Figure 4: The training and validation loss value in the training process on FFSVC.
  • Figure 5: Embedding visualization of different SV models.
  • ...and 1 more figures