Table of Contents
Fetching ...

SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASR

Pengcheng Guo, Xuankai Chang, Hang Lv, Shinji Watanabe, Lei Xie

TL;DR

This work tackles target-speaker ASR under overlapped speech by adapting the Whisper foundation model with a novel Speaker-Querying mechanism (SQ-Whisper). It introduces the SQ-Former to generate target-speaker prompts from enrollment and mixture speech, augmented by a speaker contrastive loss and flexible prompt injection schemes, combined with parameter-efficient fine-tuning. Empirical results on Libri2Mix, WSJ0-2Mix, and AMI show substantial WER reductions, attaining state-of-the-art performance (e.g., $14.6\%$ on Libri2Mix Test and $4.4\%$ on WSJ0-2Mix Test with augmentation) and strong generalization across real-world data. The framework yields a principled pathway to extend supervised foundation models for TS-ASR, with broad applicability to other multi-speaker and multimodal scenarios.

Abstract

Benefiting from massive and diverse data sources, speech foundation models exhibit strong generalization and knowledge transfer capabilities to a wide range of downstream tasks. However, a limitation arises from their exclusive handling of single-speaker speech input, making them ineffective in recognizing multi-speaker overlapped speech, a common occurrence in real-world scenarios. In this study, we delve into the adaptation of speech foundation models to eliminate interfering speakers from overlapping speech and perform target-speaker automatic speech recognition (TS-ASR). Initially, we utilize the Whisper model as the foundation for adaptation and conduct a thorough comparison of its integration with existing target-speaker adaptation techniques. We then propose an innovative model termed Speaker-Querying Whisper (SQ-Whisper), which employs a set number of trainable queries to capture speaker prompts from overlapping speech based on target-speaker enrollment. These prompts serve to steer the model in extracting speaker-specific features and accurately recognizing target-speaker transcriptions. Experimental results demonstrate that our approach effectively adapts the pre-trained speech foundation model to TS-ASR. Compared with the robust TS-HuBERT model, the proposed SQ-Whisper significantly improves performance, yielding up to 15% and 10% relative reductions in word error rates (WERs) on the Libri2Mix and WSJ0-2Mix datasets, respectively. With data augmentation, we establish new state-of-the-art WERs of 14.6% on the Libri2Mix Test set and 4.4% on the WSJ0-2Mix Test set. Furthermore, we evaluate our model on the real-world AMI meeting dataset, which shows consistent improvement over other adaptation methods.

SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASR

TL;DR

This work tackles target-speaker ASR under overlapped speech by adapting the Whisper foundation model with a novel Speaker-Querying mechanism (SQ-Whisper). It introduces the SQ-Former to generate target-speaker prompts from enrollment and mixture speech, augmented by a speaker contrastive loss and flexible prompt injection schemes, combined with parameter-efficient fine-tuning. Empirical results on Libri2Mix, WSJ0-2Mix, and AMI show substantial WER reductions, attaining state-of-the-art performance (e.g., on Libri2Mix Test and on WSJ0-2Mix Test with augmentation) and strong generalization across real-world data. The framework yields a principled pathway to extend supervised foundation models for TS-ASR, with broad applicability to other multi-speaker and multimodal scenarios.

Abstract

Benefiting from massive and diverse data sources, speech foundation models exhibit strong generalization and knowledge transfer capabilities to a wide range of downstream tasks. However, a limitation arises from their exclusive handling of single-speaker speech input, making them ineffective in recognizing multi-speaker overlapped speech, a common occurrence in real-world scenarios. In this study, we delve into the adaptation of speech foundation models to eliminate interfering speakers from overlapping speech and perform target-speaker automatic speech recognition (TS-ASR). Initially, we utilize the Whisper model as the foundation for adaptation and conduct a thorough comparison of its integration with existing target-speaker adaptation techniques. We then propose an innovative model termed Speaker-Querying Whisper (SQ-Whisper), which employs a set number of trainable queries to capture speaker prompts from overlapping speech based on target-speaker enrollment. These prompts serve to steer the model in extracting speaker-specific features and accurately recognizing target-speaker transcriptions. Experimental results demonstrate that our approach effectively adapts the pre-trained speech foundation model to TS-ASR. Compared with the robust TS-HuBERT model, the proposed SQ-Whisper significantly improves performance, yielding up to 15% and 10% relative reductions in word error rates (WERs) on the Libri2Mix and WSJ0-2Mix datasets, respectively. With data augmentation, we establish new state-of-the-art WERs of 14.6% on the Libri2Mix Test set and 4.4% on the WSJ0-2Mix Test set. Furthermore, we evaluate our model on the real-world AMI meeting dataset, which shows consistent improvement over other adaptation methods.

Paper Structure

This paper contains 26 sections, 6 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Overview of the TSE-Whisper model. "TSE" is the target-speaker extraction module that extracts target-speaker features.
  • Figure 2: Overview of the proposed SQ-Whisper model. SQ-Former indicates the Speaker-Querying Transformer module for learning target-speaker prompts. $+$ and $-$ refer to positive and negative sample pairs, respectively, which are used to compute the speaker contrastive loss $\mathcal{L}_{\text{Contrastive}}$.
  • Figure 3: Details of the proposed SQ-Former module. Here, N$\times$ means to stack N blocks.
  • Figure 4: Word error rates (WERs %) of our proposed SQ-Whisper with or without speaker contrastive loss.
  • Figure 5: T-SNE visualization of the learned speaker prompts with or without speaker contrastive loss. The legend refers to the speaker ID. A noticeable enhancement in distinctiveness is observed upon incorporating the speaker contrastive loss.
  • ...and 1 more figures