Table of Contents
Fetching ...

Random Initialization of Gated Sparse Adapters

Vi Retault, Yohaï-Eliel Berreby

TL;DR

RIGSA investigates a memory-efficient fine-tuning approach that starts from randomly initialized full-rank adapters and uses a gating mechanism together with iterative magnitude pruning to produce sparse, trainable updates for a frozen base model. By optimizing $W = W_0 + \alpha \Delta W$ with a near-zero gate $\alpha$, the method stabilizes early training while allowing substantial adaptation, followed by iterative pruning to yield a sparse $\Delta W_T$ and a final retraining pass. In experiments with SmolLM2-1.7B-Instruct on a novel Textual MNIST task, RIGSA learns the target and exhibits less forgetting on source tasks than QLoRA, though it does not consistently outperform random masking or reach the target-task accuracy of the best sparse baselines. The work highlights potential regularization benefits of sparse adaptation and motivates broader, repeatable comparisons and hyperparameter sweeps to better understand the trade-offs between target-task performance and forgetting in foundation-model fine-tuning.

Abstract

When fine-tuning language models on new tasks, catastrophic forgetting -- performance degradation on previously-learned tasks -- is a ubiquitous problem. While Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA address this through low-rank adapters, sparse adaptation offers an alternative that doesn't impose rank constraints. We introduce Random Initialization of Gated Sparse Adapters (RIGSA), which starts from randomly-initialized full-rank adapters, gates them with a ReZero analog, and sparsifies them with iterative magnitude pruning. We evaluate RIGSA on SmolLM2-1.7B-Instruct using a novel vision-in-text task (Textual MNIST) and measure forgetting on PIQA, HellaSwag, and GSM8k. SmolLM2-1.7B-Instruct initially performs around chance level on Textual MNIST, and is capable of learning the task through RIGSA, 4-bit QLoRA and random masking. In spite of having more trainable parameters than QLoRA, the RIGSA configurations that we studied displayed less forgetting than QLoRA, particularly on GSM8k, though it performs comparably to random masking.

Random Initialization of Gated Sparse Adapters

TL;DR

RIGSA investigates a memory-efficient fine-tuning approach that starts from randomly initialized full-rank adapters and uses a gating mechanism together with iterative magnitude pruning to produce sparse, trainable updates for a frozen base model. By optimizing with a near-zero gate , the method stabilizes early training while allowing substantial adaptation, followed by iterative pruning to yield a sparse and a final retraining pass. In experiments with SmolLM2-1.7B-Instruct on a novel Textual MNIST task, RIGSA learns the target and exhibits less forgetting on source tasks than QLoRA, though it does not consistently outperform random masking or reach the target-task accuracy of the best sparse baselines. The work highlights potential regularization benefits of sparse adaptation and motivates broader, repeatable comparisons and hyperparameter sweeps to better understand the trade-offs between target-task performance and forgetting in foundation-model fine-tuning.

Abstract

When fine-tuning language models on new tasks, catastrophic forgetting -- performance degradation on previously-learned tasks -- is a ubiquitous problem. While Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA address this through low-rank adapters, sparse adaptation offers an alternative that doesn't impose rank constraints. We introduce Random Initialization of Gated Sparse Adapters (RIGSA), which starts from randomly-initialized full-rank adapters, gates them with a ReZero analog, and sparsifies them with iterative magnitude pruning. We evaluate RIGSA on SmolLM2-1.7B-Instruct using a novel vision-in-text task (Textual MNIST) and measure forgetting on PIQA, HellaSwag, and GSM8k. SmolLM2-1.7B-Instruct initially performs around chance level on Textual MNIST, and is capable of learning the task through RIGSA, 4-bit QLoRA and random masking. In spite of having more trainable parameters than QLoRA, the RIGSA configurations that we studied displayed less forgetting than QLoRA, particularly on GSM8k, though it performs comparably to random masking.

Paper Structure

This paper contains 17 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: An MNIST image of the digit 6 (left), and its Textual MNIST representation (right). 0-255 pixel values are quantized to the 0-9 range, so that each pixel maps to a single ASCII digit.
  • Figure 2: Test accuracy on source and target tasks at each pruning iteration, for our method ("RIGSA"), a random mask with a parameter budget equal to the one available at the last pruning iteration ("Random Mask"), and the model before any fine-tuning ("Baseline"). a.: On the target task (Textual MNIST), more pruning iterations (and thus a smaller number of trainable parameters) leads to an accuracy drop, ultimately reverting to near the random-mask performance. b-d.: Fine-tuning with a random mask appears to improve performance on some source tasks, though this may be due to inter-run variability rather than a genuine effect, as one would typically expect performance to degrade (or remain unchanged) on source tasks when fine-tuning on a new target task.
  • Figure 3: QLoRA experiments, with unsloth/SmolLM2-1.7B-Instruct-bnb-4bit as the base model. We report the test accuracy as a function of the QLoRA rank on the source (a.) and target tasks (b.-d.). Note that the baseline performance is lower than on \ref{['fig:smollm2_forgetting_step']} due to the impact of 4-bit quantization of the base model. a.: In spite of performance of the original model being around chance level (\ref{['tab:smollm2_mnist_baseline']}), our Textual MNIST task is learnable across all tested QLoRA ranks (1-16). b.-d.: Performance on source tasks appears to increase with the QLoRA rank. A higher rank map to higher trainable parameter counts; thus, it could be expected to lead to a more significant deviation from the base model, more pronounced overfitting to the Textual MNIST task, and thus worse performance degradation on the source tasks. However, this is not what we observe. A low-rank $\Delta W$ might be counter-productive: while it enforces a lower the number of trainable parameters, it does so in a rigid way. In contrast, a higher-rank $\Delta W$ might more readily express a more "natural", less ad-hoc adaptation of the base model's weights. This effect, combined with the regularization induced by weight decay, might lead to the enhanced preservation of source-task performance with increasing rank that we observe.
  • Figure 4: Test accuracy on Textual MNIST vs. number of trainable parameters across adapters. Test accuracy for QLoRA remains relatively stable across trainable parameter budgets. In contrast, RIGSA's performance increases with the parameter budget.
  • Figure 5: Textual MNIST classification prompt.
  • ...and 1 more figures