Table of Contents
Fetching ...

Speech-FT: Merging Pre-trained And Fine-Tuned Speech Representation Models For Cross-Task Generalization

Tzu-Quan Lin, Wei-Ping Huang, Hao Tang, Hung-yi Lee

TL;DR

Fine-tuning speech representation models can improve task performance but often harms cross-task generalization. Speech-FT mitigates this by first performing stable fine-tuning to minimize representational drift, then merging the pre-trained and fine-tuned models via weight-space interpolation, effectively preserving general representations while injecting task-specific knowledge. Across HuBERT, wav2vec 2.0, DeCoAR 2.0, and WavLM Base+, Speech-FT consistently outperforms weight-space regularization, LoRA/DoRA, and early checkpoints, and yields notable gains on SUPERB, including cross-lingual adaptation scenarios. Theoretical insights via LLFC and Task Arithmetic explain why interpolation preserves feature geometry and enables robust cross-task generalization with practical, low-cost deployment.

Abstract

Fine-tuning speech representation models can enhance performance on specific tasks but often compromises their cross-task generalization ability. This degradation is often caused by excessive changes in the representations, making it difficult to retain information learned during pre-training. Existing approaches, such as regularizing weight changes during fine-tuning, may fail to maintain sufficiently high feature similarity with the pre-trained model, and thus could possibly lose cross-task generalization. To address this issue, we propose Speech-FT, a novel two-stage fine-tuning framework designed to maintain cross-task generalization while benefiting from fine-tuning. Speech-FT first applies fine-tuning specifically designed to reduce representational drift, followed by weight-space interpolation with the pre-trained model to restore cross-task generalization. Extensive experiments on HuBERT, wav2vec 2.0, DeCoAR 2.0, and WavLM Base+ demonstrate that Speech-FT consistently improves performance across a wide range of supervised, unsupervised, and multitask fine-tuning scenarios. Moreover, Speech-FT achieves superior cross-task generalization compared to fine-tuning baselines that explicitly constrain weight changes, such as weight-space regularization and LoRA fine-tuning. Our analysis reveals that Speech-FT maintains higher feature similarity to the pre-trained model compared to alternative strategies, despite allowing larger weight-space updates. Notably, Speech-FT achieves significant improvements on the SUPERB benchmark. For example, when fine-tuning HuBERT on automatic speech recognition, Speech-FT is able to reduce phone error rate from 5.17% to 3.94%, lower word error rate from 6.38% to 5.75%, and increase speaker identification accuracy from 81.86% to 84.11%. Speech-FT provides a simple yet powerful solution for further refining speech representation models after pre-training.

Speech-FT: Merging Pre-trained And Fine-Tuned Speech Representation Models For Cross-Task Generalization

TL;DR

Fine-tuning speech representation models can improve task performance but often harms cross-task generalization. Speech-FT mitigates this by first performing stable fine-tuning to minimize representational drift, then merging the pre-trained and fine-tuned models via weight-space interpolation, effectively preserving general representations while injecting task-specific knowledge. Across HuBERT, wav2vec 2.0, DeCoAR 2.0, and WavLM Base+, Speech-FT consistently outperforms weight-space regularization, LoRA/DoRA, and early checkpoints, and yields notable gains on SUPERB, including cross-lingual adaptation scenarios. Theoretical insights via LLFC and Task Arithmetic explain why interpolation preserves feature geometry and enables robust cross-task generalization with practical, low-cost deployment.

Abstract

Fine-tuning speech representation models can enhance performance on specific tasks but often compromises their cross-task generalization ability. This degradation is often caused by excessive changes in the representations, making it difficult to retain information learned during pre-training. Existing approaches, such as regularizing weight changes during fine-tuning, may fail to maintain sufficiently high feature similarity with the pre-trained model, and thus could possibly lose cross-task generalization. To address this issue, we propose Speech-FT, a novel two-stage fine-tuning framework designed to maintain cross-task generalization while benefiting from fine-tuning. Speech-FT first applies fine-tuning specifically designed to reduce representational drift, followed by weight-space interpolation with the pre-trained model to restore cross-task generalization. Extensive experiments on HuBERT, wav2vec 2.0, DeCoAR 2.0, and WavLM Base+ demonstrate that Speech-FT consistently improves performance across a wide range of supervised, unsupervised, and multitask fine-tuning scenarios. Moreover, Speech-FT achieves superior cross-task generalization compared to fine-tuning baselines that explicitly constrain weight changes, such as weight-space regularization and LoRA fine-tuning. Our analysis reveals that Speech-FT maintains higher feature similarity to the pre-trained model compared to alternative strategies, despite allowing larger weight-space updates. Notably, Speech-FT achieves significant improvements on the SUPERB benchmark. For example, when fine-tuning HuBERT on automatic speech recognition, Speech-FT is able to reduce phone error rate from 5.17% to 3.94%, lower word error rate from 6.38% to 5.75%, and increase speaker identification accuracy from 81.86% to 84.11%. Speech-FT provides a simple yet powerful solution for further refining speech representation models after pre-training.

Paper Structure

This paper contains 33 sections, 5 equations, 3 figures, 11 tables, 1 algorithm.

Figures (3)

  • Figure 1: The pipeline of Speech-FT for representation learning and evaluation. Step 1: A pre-trained representation model $\theta_0$ undergoes stable fine-tuning on a specific task $\hat{t}$, producing a tuned representation model $\theta'$ while discarding the task prediction model $D$. Step 2: The pre-trained and tuned models are merged in weight space to obtain the final representation model $\hat{\theta}$. Step 3: The merged model is evaluated on the SUPERB benchmark by re-training task-specific downstream models, ensuring that cross-task generalization of $\hat{\theta}$ is measured rather than performance of the discarded $D$.
  • Figure 2: Feature similarity with the pre-trained model. (Top) Effect of $\alpha$ on the cosine similarity between Speech-FT and pre-trained features. (Bottom) Cosine similarity between the pre-trained features and those from Speech-FT, weight-space regularization ("Weight-Space Reg."), LoRA fine-tuning ("LoRA"), early checkpoint during fine-tuning ("Early Checkpoint"), and feature-space regularization ("Feature-Space Reg.").
  • Figure 3: Average L2 distortion per parameter in the weight space with respect to the pre-trained model. "Weight-Space Reg." denotes weight-space regularization.