Less Forgetting for Better Generalization: Exploring Continual-learning Fine-tuning Methods for Speech Self-supervised Representations
Salah Zaiem, Titouan Parcollet, Slim Essid
TL;DR
The paper addresses the problem that fine-tuning speech self-supervised representations often leads to forgetting pretraining knowledge and harming generalization. It evaluates continual-learning-inspired fine-tuning strategies, including freezing-based (LoRa, adapters, EWC) and replay-based (LS-Replay, Auto-Replay) methods, on English and Danish ASR with Data2Vec Base and XLSR-53 backbones. The results demonstrate substantial in-domain and out-of-domain improvements over full fine-tuning, with relative gains up to 22.5% in WER for Danish and consistent gains across languages, linked to reduced forgetting of the SSL task. The work highlights the practical impact of preserving pretraining knowledge during fine-tuning for robust, data-efficient ASR and releases code to support reproduction.
Abstract
Despite being trained on massive and diverse datasets, speech self-supervised encoders are generally used for downstream purposes as mere frozen feature extractors or model initializers before fine-tuning. The former severely limits the exploitation of large encoders, while the latter hurts the robustness acquired during pretraining, especially in low-resource scenarios. This work explores middle-ground solutions, conjecturing that reducing the forgetting of the self-supervised task during the downstream fine-tuning leads to better generalization. To prove this, focusing on speech recognition, we benchmark different continual-learning approaches during fine-tuning and show that they improve both in-domain and out-of-domain generalization abilities. Relative performance gains reach 15.7% and 22.5% with XLSR used as the encoder on two English and Danish speech recognition tasks. Further probing experiments show that these gains are indeed linked to less forgetting.
