Extending Whisper with prompt tuning to target-speaker ASR

Hao Ma; Zhiyuan Peng; Mingjie Shao; Jing Li; Ju Liu

Extending Whisper with prompt tuning to target-speaker ASR

Hao Ma, Zhiyuan Peng, Mingjie Shao, Jing Li, Ju Liu

TL;DR

Target-speaker ASR from overlapped speech is challenging for single-talker models. The paper develops a parameter-efficient approach that extends Whisper to TS-ASR using prompt tuning, deep prompting, and reparameterization. It achieves competitive performance with only about 1% of task-specific parameters, while preserving Whisper's inverse text normalization and timestamping abilities. This work demonstrates a practical path to adapting large foundation models to multi-talker TS-ASR with minimal fine-tuning cost.

Abstract

Target-speaker automatic speech recognition (ASR) aims to transcribe the desired speech of a target speaker from multi-talker overlapped utterances. Most of the existing target-speaker ASR (TS-ASR) methods involve either training from scratch or fully fine-tuning a pre-trained model, leading to significant training costs and becoming inapplicable to large foundation models. This work leverages prompt tuning, a parameter-efficient fine-tuning approach, to extend Whisper, a large-scale single-talker ASR model, to TS-ASR. Variants of prompt tuning approaches along with their configurations are explored and optimized for TS-ASR.Experimental results show that prompt tuning can achieve performance comparable to state-of-the-art full training approaches while only requiring about 1\% of task-specific model parameters. Notably, the original Whisper's features, such as inverse text normalization and timestamp tagging, are retained in target-speaker ASR, keeping the generated transcriptions natural and informative.

Extending Whisper with prompt tuning to target-speaker ASR

TL;DR

Abstract

Paper Structure (16 sections, 4 equations, 1 figure, 3 tables)

This paper contains 16 sections, 4 equations, 1 figure, 3 tables.

Introduction
Methods
Background of Whisper
Extending Whisper to Target-speaker ASR
Prompt Tuning
Deep Prompting
Reparameterization of Soft Prompts
Experiments
Dataset and Evaluation Metric
Training Configuration
Main Results
Selection of Prompt Length and Reparameterization Method
SOTA Comparison
Ablation Study
Retaining Whisper's Featured Abilities
...and 1 more sections

Figures (1)

Figure 1: Overview of proposed prompting framework. Modules with solid-line borders are included in the baseline configuration, while modules with dashed-line borders are optional.

Extending Whisper with prompt tuning to target-speaker ASR

TL;DR

Abstract

Extending Whisper with prompt tuning to target-speaker ASR

Authors

TL;DR

Abstract

Table of Contents

Figures (1)