Robust fine-tuning of speech recognition models via model merging: application to disordered speech
Alexandre Ducorroy, Rachid Riad
TL;DR
This study investigates model merging as a robust adaptation method for dysarthric ASR using Whisper as the base language model. By aggregating multiple fine-tuning trajectories through MAST, MAcT, and SMAcT, the authors demonstrate consistent WER improvements over standard fine-tuning, with SMAcT achieving the best results (WER 12.38, SEM 84.34) on SAP challenge test sets. The gains are especially pronounced for long utterances, with a 16.2% relative improvement, and persist even in low-data regimes and across different Whisper architectures. The approach requires no extra inference cost or hyperparameter tuning, and shows generalizability across model sizes, suggesting practical utility for inclusive ASR in disordered speech contexts.
Abstract
Automatic Speech Recognition (ASR) has advanced with Speech Foundation Models (SFMs), yet performance degrades on dysarthric speech due to variability and limited data. This study as part of the submission to the Speech Accessibility challenge, explored model merging to improve ASR generalization using Whisper as the base SFM. We compared fine-tuning with single-trajectory merging, combining models from one fine-tuning path, and multi-run merging, merging independently trained models. Our best multi-run merging approach achieved a 12% relative decrease of WER over classic fine-tuning, and a 16.2% relative decrease on long-form audios, a major loss contributor in dysarthric ASR. Merging more and more models led to continuous gains, remained effective in low-data regimes, and generalized across model architectures. These results highlight model merging as an easily replicable adaptation method that consistently improves ASR without additional inference cost or hyperparameter tuning.
