Exploring Generative Error Correction for Dysarthric Speech Recognition
Moreno La Quatra, Alkis Koudounas, Valerio Mario Salerno, Sabato Marco Siniscalchi
TL;DR
This work tackles dysarthric speech recognition by pairing a general ASR system with a Generative Error Correction (GER) stage driven by a large language model. The approach first generates multiple ASR hypotheses (N-best) and then uses a prompt-driven GER model to synthesize a refined transcription, with hypothesis diversity ensuring coverage of possible interpretations. Experiments on the Speech Accessibility Project dataset reveal strong gains from ASR adaptation and GER, achieving the best development performance when both components are combined, while isolated-word recognition remains a major challenge. The findings highlight the complementary strengths of acoustic modeling and linguistic correction for dysarthric speech and point to directions for improving robustness and isolating word-level transcription in future work.
Abstract
Despite the remarkable progress in end-to-end Automatic Speech Recognition (ASR) engines, accurately transcribing dysarthric speech remains a major challenge. In this work, we proposed a two-stage framework for the Speech Accessibility Project Challenge at INTERSPEECH 2025, which combines cutting-edge speech recognition models with LLM-based generative error correction (GER). We assess different configurations of model scales and training strategies, incorporating specific hypothesis selection to improve transcription accuracy. Experiments on the Speech Accessibility Project dataset demonstrate the strength of our approach on structured and spontaneous speech, while highlighting challenges in single-word recognition. Through comprehensive analysis, we provide insights into the complementary roles of acoustic and linguistic modeling in dysarthric speech recognition
