Table of Contents
Fetching ...

Exploring Generative Error Correction for Dysarthric Speech Recognition

Moreno La Quatra, Alkis Koudounas, Valerio Mario Salerno, Sabato Marco Siniscalchi

TL;DR

This work tackles dysarthric speech recognition by pairing a general ASR system with a Generative Error Correction (GER) stage driven by a large language model. The approach first generates multiple ASR hypotheses (N-best) and then uses a prompt-driven GER model to synthesize a refined transcription, with hypothesis diversity ensuring coverage of possible interpretations. Experiments on the Speech Accessibility Project dataset reveal strong gains from ASR adaptation and GER, achieving the best development performance when both components are combined, while isolated-word recognition remains a major challenge. The findings highlight the complementary strengths of acoustic modeling and linguistic correction for dysarthric speech and point to directions for improving robustness and isolating word-level transcription in future work.

Abstract

Despite the remarkable progress in end-to-end Automatic Speech Recognition (ASR) engines, accurately transcribing dysarthric speech remains a major challenge. In this work, we proposed a two-stage framework for the Speech Accessibility Project Challenge at INTERSPEECH 2025, which combines cutting-edge speech recognition models with LLM-based generative error correction (GER). We assess different configurations of model scales and training strategies, incorporating specific hypothesis selection to improve transcription accuracy. Experiments on the Speech Accessibility Project dataset demonstrate the strength of our approach on structured and spontaneous speech, while highlighting challenges in single-word recognition. Through comprehensive analysis, we provide insights into the complementary roles of acoustic and linguistic modeling in dysarthric speech recognition

Exploring Generative Error Correction for Dysarthric Speech Recognition

TL;DR

This work tackles dysarthric speech recognition by pairing a general ASR system with a Generative Error Correction (GER) stage driven by a large language model. The approach first generates multiple ASR hypotheses (N-best) and then uses a prompt-driven GER model to synthesize a refined transcription, with hypothesis diversity ensuring coverage of possible interpretations. Experiments on the Speech Accessibility Project dataset reveal strong gains from ASR adaptation and GER, achieving the best development performance when both components are combined, while isolated-word recognition remains a major challenge. The findings highlight the complementary strengths of acoustic modeling and linguistic correction for dysarthric speech and point to directions for improving robustness and isolating word-level transcription in future work.

Abstract

Despite the remarkable progress in end-to-end Automatic Speech Recognition (ASR) engines, accurately transcribing dysarthric speech remains a major challenge. In this work, we proposed a two-stage framework for the Speech Accessibility Project Challenge at INTERSPEECH 2025, which combines cutting-edge speech recognition models with LLM-based generative error correction (GER). We assess different configurations of model scales and training strategies, incorporating specific hypothesis selection to improve transcription accuracy. Experiments on the Speech Accessibility Project dataset demonstrate the strength of our approach on structured and spontaneous speech, while highlighting challenges in single-word recognition. Through comprehensive analysis, we provide insights into the complementary roles of acoustic and linguistic modeling in dysarthric speech recognition

Paper Structure

This paper contains 15 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of our two-stage framework for dysarthric speech recognition. Stage 1 uses the ASR model to generate 20-best hypotheses from the input audio. Stage 2 selects diverse hypotheses and employs the GER model to analyze them collectively, producing a refined final transcription.
  • Figure 1: Performance comparison across ASR and GER configurations.
  • Figure 2: GER model prompt.