Speech Recognition With LLMs Adapted to Disordered Speech Using Reinforcement Learning
Chirag Nagpal, Subhashini Venugopalan, Jimmy Tobin, Marilyn Ladewig, Katherine Heller, Katrin Tomanek
TL;DR
This work demonstrates how to repurpose a decoder-only LLM for automatic speech recognition by discretizing audio into tokens that substitute rare text tokens in the vocabulary. The model is first fine-tuned on mixtures of standard and disordered speech, then further adapted with reinforcement learning using rewards for semantic meaning preservation and syntactic accuracy (WER). The RLHF approach yields substantial improvements over supervised fine-tuning in adapting to disordered speech, though it does not surpass specialized ASR systems; the best results occur when meaning preservation is weighted strongly. The study validates an effective, scalable strategy for domain adaptation of LLMs to speech by combining audio-token discretization with multi-objective RLHF, and highlights avenues for extending to larger models and other languages.
Abstract
We introduce a large language model (LLM) capable of processing speech inputs and show that tuning it further with reinforcement learning on human preference (RLHF) enables it to adapt better to disordered speech than traditional fine-tuning. Our method replaces low-frequency text tokens in an LLM's vocabulary with audio tokens and enables the model to recognize speech by fine-tuning it on speech with transcripts. We then use RL with rewards based on syntactic and semantic accuracy measures generalizing the LLM further to recognize disordered speech. While the resulting LLM does not outperform existing systems for speech recognition, we find that tuning with reinforcement learning using custom rewards leads to substantially better performance than supervised fine-tuning of the language model, specifically when adapting to speech in a different setting. This presents a compelling alternative tuning strategy for speech recognition using large language models.
