Table of Contents
Fetching ...

Grow, Don't Overwrite: Fine-tuning Without Forgetting

Dyah Adila, Hanna Mazzawi, Benoit Dherin, Xavier Gonzalvo

TL;DR

A novel function-preserving expansion method that eliminates the trade-off between plasticity and stability, matching the performance of full fine-tuning on downstream tasks without any degradation of the model's original capabilities.

Abstract

Adapting pre-trained models to specialized tasks often leads to catastrophic forgetting, where new knowledge overwrites foundational capabilities. Existing methods either compromise performance on the new task or struggle to balance training stability with efficient reuse of pre-trained knowledge. We introduce a novel function-preserving expansion method that resolves this dilemma. Our technique expands model capacity by replicating pre-trained parameters within transformer submodules and applying a scaling correction that guarantees the expanded model is mathematically identical to the original at initialization, enabling stable training while exploiting existing knowledge. Empirically, our method eliminates the trade-off between plasticity and stability, matching the performance of full fine-tuning on downstream tasks without any degradation of the model's original capabilities. Furthermore, we demonstrate the modularity of our approach, showing that by selectively expanding a small subset of layers we can achieve the same performance as full fine-tuning at a fraction of the computational cost.

Grow, Don't Overwrite: Fine-tuning Without Forgetting

TL;DR

A novel function-preserving expansion method that eliminates the trade-off between plasticity and stability, matching the performance of full fine-tuning on downstream tasks without any degradation of the model's original capabilities.

Abstract

Adapting pre-trained models to specialized tasks often leads to catastrophic forgetting, where new knowledge overwrites foundational capabilities. Existing methods either compromise performance on the new task or struggle to balance training stability with efficient reuse of pre-trained knowledge. We introduce a novel function-preserving expansion method that resolves this dilemma. Our technique expands model capacity by replicating pre-trained parameters within transformer submodules and applying a scaling correction that guarantees the expanded model is mathematically identical to the original at initialization, enabling stable training while exploiting existing knowledge. Empirically, our method eliminates the trade-off between plasticity and stability, matching the performance of full fine-tuning on downstream tasks without any degradation of the model's original capabilities. Furthermore, we demonstrate the modularity of our approach, showing that by selectively expanding a small subset of layers we can achieve the same performance as full fine-tuning at a fraction of the computational cost.
Paper Structure (39 sections, 8 equations, 8 figures, 2 tables)

This paper contains 39 sections, 8 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: (a) We double the MLP's hidden dimension by duplicating the up-projection weights ($W_n^{(1)}$) and compensating in the down-projection layer ($W_n^{(2)}$) to preserve the original function. (b) In the G-Freeze variant, only new parameters (darker shades) are trained. In the G-Train variant, the entire up-projection matrix is trained while the down-projection matrix is frozen, as indicated by the snowflake symbol.
  • Figure 2: SFT (blue) shows severe degradation on the original domain (top plots), particularly for tasks with large domain shifts like translation and entailment. Our method (green, orange) maintain original performance while matching or exceeding the baseline on the new fine-tuning tasks (bottom plots).
  • Figure 3: Full performance can be achieved with a fraction of the trainable parameters. Growing a targeted subset of 10 layers (green) consistently matches the performance of growing all layers (orange).
  • Figure 4: Performance scales with the number of grown layers $N$. New task performance (bottom) improves as $N$ increases, an effect most significant on the more complex MathQA task (b).
  • Figure 5: The effective rank of weight update matrix. Brighter colors in the colorbar indicate a higher rank. The x-axis is the layer index, the y-axis is the training stage--with the earliest stages at the top and the latest at the bottom. Cognitively demanding tasks like MathQA, involve high-rank weight updates in almost all layers.
  • ...and 3 more figures

Theorems & Definitions (1)

  • proof : Proof of Function Preservation