Table of Contents
Fetching ...

Speech Recognition With LLMs Adapted to Disordered Speech Using Reinforcement Learning

Chirag Nagpal, Subhashini Venugopalan, Jimmy Tobin, Marilyn Ladewig, Katherine Heller, Katrin Tomanek

TL;DR

This work demonstrates how to repurpose a decoder-only LLM for automatic speech recognition by discretizing audio into tokens that substitute rare text tokens in the vocabulary. The model is first fine-tuned on mixtures of standard and disordered speech, then further adapted with reinforcement learning using rewards for semantic meaning preservation and syntactic accuracy (WER). The RLHF approach yields substantial improvements over supervised fine-tuning in adapting to disordered speech, though it does not surpass specialized ASR systems; the best results occur when meaning preservation is weighted strongly. The study validates an effective, scalable strategy for domain adaptation of LLMs to speech by combining audio-token discretization with multi-objective RLHF, and highlights avenues for extending to larger models and other languages.

Abstract

We introduce a large language model (LLM) capable of processing speech inputs and show that tuning it further with reinforcement learning on human preference (RLHF) enables it to adapt better to disordered speech than traditional fine-tuning. Our method replaces low-frequency text tokens in an LLM's vocabulary with audio tokens and enables the model to recognize speech by fine-tuning it on speech with transcripts. We then use RL with rewards based on syntactic and semantic accuracy measures generalizing the LLM further to recognize disordered speech. While the resulting LLM does not outperform existing systems for speech recognition, we find that tuning with reinforcement learning using custom rewards leads to substantially better performance than supervised fine-tuning of the language model, specifically when adapting to speech in a different setting. This presents a compelling alternative tuning strategy for speech recognition using large language models.

Speech Recognition With LLMs Adapted to Disordered Speech Using Reinforcement Learning

TL;DR

This work demonstrates how to repurpose a decoder-only LLM for automatic speech recognition by discretizing audio into tokens that substitute rare text tokens in the vocabulary. The model is first fine-tuned on mixtures of standard and disordered speech, then further adapted with reinforcement learning using rewards for semantic meaning preservation and syntactic accuracy (WER). The RLHF approach yields substantial improvements over supervised fine-tuning in adapting to disordered speech, though it does not surpass specialized ASR systems; the best results occur when meaning preservation is weighted strongly. The study validates an effective, scalable strategy for domain adaptation of LLMs to speech by combining audio-token discretization with multi-objective RLHF, and highlights avenues for extending to larger models and other languages.

Abstract

We introduce a large language model (LLM) capable of processing speech inputs and show that tuning it further with reinforcement learning on human preference (RLHF) enables it to adapt better to disordered speech than traditional fine-tuning. Our method replaces low-frequency text tokens in an LLM's vocabulary with audio tokens and enables the model to recognize speech by fine-tuning it on speech with transcripts. We then use RL with rewards based on syntactic and semantic accuracy measures generalizing the LLM further to recognize disordered speech. While the resulting LLM does not outperform existing systems for speech recognition, we find that tuning with reinforcement learning using custom rewards leads to substantially better performance than supervised fine-tuning of the language model, specifically when adapting to speech in a different setting. This presents a compelling alternative tuning strategy for speech recognition using large language models.
Paper Structure (13 sections, 1 equation, 3 figures, 4 tables)

This paper contains 13 sections, 1 equation, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Using an LLM for ASR: Our approach involves first clustering the audio embeddings to the LLMs vocabulary space. Followed by supervised finetuning on a mixture of disoredred speech. In the final step we use reinforcement learning to improve the model's output in terms of Word Error Rate and human assessed meaning preservation quality.
  • Figure 2: Cross entropy Loss vs. Learning Steps of the Gemma 2B base LLM for different training mixtures of Librispeech and Euphonia data on a held-out validation set. Augmenting the disordered speech training data with generic ASR data from Librispeech improves generalization.
  • Figure 3: Higher strength of $\gamma$ leads to improvement in meaning preservation on the Euphonia val and test sets, and for all severity levels.