Perceiver-Prompt: Flexible Speaker Adaptation in Whisper for Chinese Disordered Speech Recognition
Yicong Jiang, Tianzi Wang, Xurong Xie, Juan Liu, Wei Sun, Nan Yan, Hui Chen, Lan Wang, Xunying Liu, Feng Tian
TL;DR
This paper tackles dysarthric speech recognition under data scarcity and large speaker variability by introducing Perceiver-Prompt, which combines LoRA fine-tuning of Whisper with a Perceiver-based prompt encoder that generates fixed-length speaker prompts from history utterances via P-Tuning. The prompts are concatenated with Whisper inputs to adaptively model speaker-specific articulation in Chinese dysarthric speech. Experiments on a Chinese dysarthric dataset show consistent, sometimes substantial, improvements in character, word, and sentence recognition, with up to $13.04\%$ relative $CER$ reduction over a fine-tuned Whisper baseline and up to $51.38\%$ relative reduction on the most severe cases, validating the method’s flexibility across configurations and auxiliary supervision. The work demonstrates the feasibility and impact of applying P-Tuning to a large-scale speech model for speaker adaptation in disordered speech, offering a scalable approach for real-world deployment where labeled data per speaker are limited.
Abstract
Disordered speech recognition profound implications for improving the quality of life for individuals afflicted with, for example, dysarthria. Dysarthric speech recognition encounters challenges including limited data, substantial dissimilarities between dysarthric and non-dysarthric speakers, and significant speaker variations stemming from the disorder. This paper introduces Perceiver-Prompt, a method for speaker adaptation that utilizes P-Tuning on the Whisper large-scale model. We first fine-tune Whisper using LoRA and then integrate a trainable Perceiver to generate fixed-length speaker prompts from variable-length inputs, to improve model recognition of Chinese dysarthric speech. Experimental results from our Chinese dysarthric speech dataset demonstrate consistent improvements in recognition performance with Perceiver-Prompt. Relative reduction up to 13.04% in CER is obtained over the fine-tuned Whisper.
