Table of Contents
Fetching ...

Robust fine-tuning of speech recognition models via model merging: application to disordered speech

Alexandre Ducorroy, Rachid Riad

TL;DR

This study investigates model merging as a robust adaptation method for dysarthric ASR using Whisper as the base language model. By aggregating multiple fine-tuning trajectories through MAST, MAcT, and SMAcT, the authors demonstrate consistent WER improvements over standard fine-tuning, with SMAcT achieving the best results (WER 12.38, SEM 84.34) on SAP challenge test sets. The gains are especially pronounced for long utterances, with a 16.2% relative improvement, and persist even in low-data regimes and across different Whisper architectures. The approach requires no extra inference cost or hyperparameter tuning, and shows generalizability across model sizes, suggesting practical utility for inclusive ASR in disordered speech contexts.

Abstract

Automatic Speech Recognition (ASR) has advanced with Speech Foundation Models (SFMs), yet performance degrades on dysarthric speech due to variability and limited data. This study as part of the submission to the Speech Accessibility challenge, explored model merging to improve ASR generalization using Whisper as the base SFM. We compared fine-tuning with single-trajectory merging, combining models from one fine-tuning path, and multi-run merging, merging independently trained models. Our best multi-run merging approach achieved a 12% relative decrease of WER over classic fine-tuning, and a 16.2% relative decrease on long-form audios, a major loss contributor in dysarthric ASR. Merging more and more models led to continuous gains, remained effective in low-data regimes, and generalized across model architectures. These results highlight model merging as an easily replicable adaptation method that consistently improves ASR without additional inference cost or hyperparameter tuning.

Robust fine-tuning of speech recognition models via model merging: application to disordered speech

TL;DR

This study investigates model merging as a robust adaptation method for dysarthric ASR using Whisper as the base language model. By aggregating multiple fine-tuning trajectories through MAST, MAcT, and SMAcT, the authors demonstrate consistent WER improvements over standard fine-tuning, with SMAcT achieving the best results (WER 12.38, SEM 84.34) on SAP challenge test sets. The gains are especially pronounced for long utterances, with a 16.2% relative improvement, and persist even in low-data regimes and across different Whisper architectures. The approach requires no extra inference cost or hyperparameter tuning, and shows generalizability across model sizes, suggesting practical utility for inclusive ASR in disordered speech contexts.

Abstract

Automatic Speech Recognition (ASR) has advanced with Speech Foundation Models (SFMs), yet performance degrades on dysarthric speech due to variability and limited data. This study as part of the submission to the Speech Accessibility challenge, explored model merging to improve ASR generalization using Whisper as the base SFM. We compared fine-tuning with single-trajectory merging, combining models from one fine-tuning path, and multi-run merging, merging independently trained models. Our best multi-run merging approach achieved a 12% relative decrease of WER over classic fine-tuning, and a 16.2% relative decrease on long-form audios, a major loss contributor in dysarthric ASR. Merging more and more models led to continuous gains, remained effective in low-data regimes, and generalized across model architectures. These results highlight model merging as an easily replicable adaptation method that consistently improves ASR without additional inference cost or hyperparameter tuning.

Paper Structure

This paper contains 7 sections, 1 equation, 2 figures, 4 tables.

Figures (2)

  • Figure 1: WER comparison between classical Fine-tuning vs Merging strategies of Whisper evaluated on subset of SAP development set. We used 30 different checkpoints for each strategy, and WER are reported for different lengths of audios.
  • Figure 2: Evolution of WER when merging models compared to single-model evaluations (black cross). The upper figure illustrates the WER progression when merging models along a single optimization trajectory, while the lower figure shows WER when merging models from different trajectories. In both figures, the black cross represents the WER of individual models used in the merging process.