Perceiver-Prompt: Flexible Speaker Adaptation in Whisper for Chinese Disordered Speech Recognition

Yicong Jiang; Tianzi Wang; Xurong Xie; Juan Liu; Wei Sun; Nan Yan; Hui Chen; Lan Wang; Xunying Liu; Feng Tian

Perceiver-Prompt: Flexible Speaker Adaptation in Whisper for Chinese Disordered Speech Recognition

Yicong Jiang, Tianzi Wang, Xurong Xie, Juan Liu, Wei Sun, Nan Yan, Hui Chen, Lan Wang, Xunying Liu, Feng Tian

TL;DR

This paper tackles dysarthric speech recognition under data scarcity and large speaker variability by introducing Perceiver-Prompt, which combines LoRA fine-tuning of Whisper with a Perceiver-based prompt encoder that generates fixed-length speaker prompts from history utterances via P-Tuning. The prompts are concatenated with Whisper inputs to adaptively model speaker-specific articulation in Chinese dysarthric speech. Experiments on a Chinese dysarthric dataset show consistent, sometimes substantial, improvements in character, word, and sentence recognition, with up to $13.04\%$ relative $CER$ reduction over a fine-tuned Whisper baseline and up to $51.38\%$ relative reduction on the most severe cases, validating the method’s flexibility across configurations and auxiliary supervision. The work demonstrates the feasibility and impact of applying P-Tuning to a large-scale speech model for speaker adaptation in disordered speech, offering a scalable approach for real-world deployment where labeled data per speaker are limited.

Abstract

Disordered speech recognition profound implications for improving the quality of life for individuals afflicted with, for example, dysarthria. Dysarthric speech recognition encounters challenges including limited data, substantial dissimilarities between dysarthric and non-dysarthric speakers, and significant speaker variations stemming from the disorder. This paper introduces Perceiver-Prompt, a method for speaker adaptation that utilizes P-Tuning on the Whisper large-scale model. We first fine-tune Whisper using LoRA and then integrate a trainable Perceiver to generate fixed-length speaker prompts from variable-length inputs, to improve model recognition of Chinese dysarthric speech. Experimental results from our Chinese dysarthric speech dataset demonstrate consistent improvements in recognition performance with Perceiver-Prompt. Relative reduction up to 13.04% in CER is obtained over the fine-tuned Whisper.

Perceiver-Prompt: Flexible Speaker Adaptation in Whisper for Chinese Disordered Speech Recognition

TL;DR

relative

reduction over a fine-tuned Whisper baseline and up to

relative reduction on the most severe cases, validating the method’s flexibility across configurations and auxiliary supervision. The work demonstrates the feasibility and impact of applying P-Tuning to a large-scale speech model for speaker adaptation in disordered speech, offering a scalable approach for real-world deployment where labeled data per speaker are limited.

Abstract

Paper Structure (14 sections, 2 equations, 2 figures, 3 tables)

This paper contains 14 sections, 2 equations, 2 figures, 3 tables.

Introduction
Preliminary
Large-scale pre-trained model Whisper and LoRA
P-tuning
Perceiver-Prompt
Employing Perceiver as a Prompt Encoder
Perceiver-Prompt for speaker adaptation
Experiments
Experimental Setup
General Result Analysis
Different Configuration
Joint training with additional information
Conclusion
Acknowledgement

Figures (2)

Figure 1: The flexible concatenation method of Perceiver-Prompt (optionally including data from the same speaker).
Figure 2: The three subplots (a), (b), and (c) respectively represent the t-SNE clustering results of Conf.14, Conf.2, and Conf.8. The left column represents t-SNE analysis for speakers, with different colors indicating different speakers. The right column represents t-SNE analysis for FDA severity levels, with different colors indicating different FDA severity levels.

Perceiver-Prompt: Flexible Speaker Adaptation in Whisper for Chinese Disordered Speech Recognition

TL;DR

Abstract

Perceiver-Prompt: Flexible Speaker Adaptation in Whisper for Chinese Disordered Speech Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (2)