Table of Contents
Fetching ...

Less Forgetting for Better Generalization: Exploring Continual-learning Fine-tuning Methods for Speech Self-supervised Representations

Salah Zaiem, Titouan Parcollet, Slim Essid

TL;DR

The paper addresses the problem that fine-tuning speech self-supervised representations often leads to forgetting pretraining knowledge and harming generalization. It evaluates continual-learning-inspired fine-tuning strategies, including freezing-based (LoRa, adapters, EWC) and replay-based (LS-Replay, Auto-Replay) methods, on English and Danish ASR with Data2Vec Base and XLSR-53 backbones. The results demonstrate substantial in-domain and out-of-domain improvements over full fine-tuning, with relative gains up to 22.5% in WER for Danish and consistent gains across languages, linked to reduced forgetting of the SSL task. The work highlights the practical impact of preserving pretraining knowledge during fine-tuning for robust, data-efficient ASR and releases code to support reproduction.

Abstract

Despite being trained on massive and diverse datasets, speech self-supervised encoders are generally used for downstream purposes as mere frozen feature extractors or model initializers before fine-tuning. The former severely limits the exploitation of large encoders, while the latter hurts the robustness acquired during pretraining, especially in low-resource scenarios. This work explores middle-ground solutions, conjecturing that reducing the forgetting of the self-supervised task during the downstream fine-tuning leads to better generalization. To prove this, focusing on speech recognition, we benchmark different continual-learning approaches during fine-tuning and show that they improve both in-domain and out-of-domain generalization abilities. Relative performance gains reach 15.7% and 22.5% with XLSR used as the encoder on two English and Danish speech recognition tasks. Further probing experiments show that these gains are indeed linked to less forgetting.

Less Forgetting for Better Generalization: Exploring Continual-learning Fine-tuning Methods for Speech Self-supervised Representations

TL;DR

The paper addresses the problem that fine-tuning speech self-supervised representations often leads to forgetting pretraining knowledge and harming generalization. It evaluates continual-learning-inspired fine-tuning strategies, including freezing-based (LoRa, adapters, EWC) and replay-based (LS-Replay, Auto-Replay) methods, on English and Danish ASR with Data2Vec Base and XLSR-53 backbones. The results demonstrate substantial in-domain and out-of-domain improvements over full fine-tuning, with relative gains up to 22.5% in WER for Danish and consistent gains across languages, linked to reduced forgetting of the SSL task. The work highlights the practical impact of preserving pretraining knowledge during fine-tuning for robust, data-efficient ASR and releases code to support reproduction.

Abstract

Despite being trained on massive and diverse datasets, speech self-supervised encoders are generally used for downstream purposes as mere frozen feature extractors or model initializers before fine-tuning. The former severely limits the exploitation of large encoders, while the latter hurts the robustness acquired during pretraining, especially in low-resource scenarios. This work explores middle-ground solutions, conjecturing that reducing the forgetting of the self-supervised task during the downstream fine-tuning leads to better generalization. To prove this, focusing on speech recognition, we benchmark different continual-learning approaches during fine-tuning and show that they improve both in-domain and out-of-domain generalization abilities. Relative performance gains reach 15.7% and 22.5% with XLSR used as the encoder on two English and Danish speech recognition tasks. Further probing experiments show that these gains are indeed linked to less forgetting.
Paper Structure (15 sections, 2 equations, 2 figures, 1 table)

This paper contains 15 sections, 2 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Effect of different hyper-parameters on the final performance on Danish in-domain (ID, left y-axis) and out-of-domain (OOD, right y-axis) test sets, for three different techniques (LoRa, EWC, and LS-Replay), with XLSR backbone. While LoRa seems quite robust to changes in the main hyperparameter, always remaining under the baseline, other approaches require careful tuning. In the second and third plots, the fine-tuning baseline is shown for $x=0$, while it is shown with horizontal dashed lines for the LoRa plot.
  • Figure 2: Evolution of the self-supervision task loss for 4 considered techniques on two English test sets with Data2Vec backbone. The best-performing approaches on the ASR task are the ones best-performing at the SSL task after the fine-tuning.