Table of Contents
Fetching ...

Random Masking Finds Winning Tickets for Parameter Efficient Fine-tuning

Jing Xu, Jingzhao Zhang

TL;DR

The paper investigates the limits of parameter-efficient fine-tuning by introducing Random Masking, a baseline that trains only randomly unmasked parameters in a pretrained model. It demonstrates that, with a carefully chosen learning rate, Random Masking can match the performance of standard PEFT methods like LoRA while using far fewer trainable parameters, across NLP and vision tasks. The authors provide empirical evidence of a flatter loss landscape and more distant optimization trajectories under masking, and support these findings with theoretical analysis of an overparameterized linear model showing eigenvalue concentration and enlarged stable learning rates as masking becomes sparser. Overall, the work highlights the surprising expressiveness of pretrained models and suggests that aggressive sparsity in fine-tuning can be both effective and informative for understanding PEFT dynamics and pruning opportunities.

Abstract

Fine-tuning large language models (LLM) can be costly. Parameter-efficient fine-tuning (PEFT) addresses the problems by training a fraction of the parameters, whose success reveals the expressiveness and flexibility of pretrained models. This paper studies the limit of PEFT, by further simplifying its design and reducing the number of trainable parameters beyond standard setups. To this end, we use Random Masking to fine-tune the pretrained model. Despite its simplicity, we show that Random Masking is surprisingly effective: with a larger-than-expected learning rate, Random Masking can match the performance of standard PEFT algorithms such as LoRA on various tasks, using fewer trainable parameters. We provide both empirical and theoretical explorations into the success of Random Masking. We show that masking induces a flatter loss landscape and more distant solutions, which allows for and necessitates large learning rates.

Random Masking Finds Winning Tickets for Parameter Efficient Fine-tuning

TL;DR

The paper investigates the limits of parameter-efficient fine-tuning by introducing Random Masking, a baseline that trains only randomly unmasked parameters in a pretrained model. It demonstrates that, with a carefully chosen learning rate, Random Masking can match the performance of standard PEFT methods like LoRA while using far fewer trainable parameters, across NLP and vision tasks. The authors provide empirical evidence of a flatter loss landscape and more distant optimization trajectories under masking, and support these findings with theoretical analysis of an overparameterized linear model showing eigenvalue concentration and enlarged stable learning rates as masking becomes sparser. Overall, the work highlights the surprising expressiveness of pretrained models and suggests that aggressive sparsity in fine-tuning can be both effective and informative for understanding PEFT dynamics and pruning opportunities.

Abstract

Fine-tuning large language models (LLM) can be costly. Parameter-efficient fine-tuning (PEFT) addresses the problems by training a fraction of the parameters, whose success reveals the expressiveness and flexibility of pretrained models. This paper studies the limit of PEFT, by further simplifying its design and reducing the number of trainable parameters beyond standard setups. To this end, we use Random Masking to fine-tune the pretrained model. Despite its simplicity, we show that Random Masking is surprisingly effective: with a larger-than-expected learning rate, Random Masking can match the performance of standard PEFT algorithms such as LoRA on various tasks, using fewer trainable parameters. We provide both empirical and theoretical explorations into the success of Random Masking. We show that masking induces a flatter loss landscape and more distant solutions, which allows for and necessitates large learning rates.
Paper Structure (36 sections, 4 theorems, 22 equations, 11 figures, 9 tables)

This paper contains 36 sections, 4 theorems, 22 equations, 11 figures, 9 tables.

Key Result

Theorem 5.1

Suppose that each entry of $\boldsymbol{X}$ is in $[0,r]$. Then for any $0<\delta<1$, with probability at least $1-\delta$, the following inequality for $\lambda_i$ holds for any $i$,

Figures (11)

  • Figure 1: The average performance of PEFT methods over with various numbers of trainable parameters. Masking stands for our Random Masking method; FT stands for full parameter fine-tuning; Prefix stands for Prefix-Tuning. The metrics are calculated on 11 datasets using OPT-1.3b. Despite its simple design, Random Masking achieves competitive performance with fewer trainable parameters.
  • Figure 2: Illustration of the masking methods. The red grids indicate trainable parameters and the blue grids indicate frozen parameters. (a) Full parameter fine-tuning of $\boldsymbol{W}$. (b) The Random Masking of $W$, which is the main PEFT algorithm in this paper. (c) Implementation of Random masking of $\boldsymbol{W}$ via a sparse matrix $\boldsymbol{S}$ that is stored compactly as vectors. (d) The Structured Masking of $\boldsymbol{W}$, for ablation studies in Section \ref{['sec:ablations']}.
  • Figure 3: The accuracy of Random Masking on SST-2 dataset with different learning rates. The figure shows that the accuracy remains steady despite the small number of trainable parameters, as long as using an appropriate learning rate. As the trainable parameter ratio becomes smaller, the optimal learning rate becomes larger. The complete results of SuperGLUE benchmark are given in Figure \ref{['fig:lrapx1']}, \ref{['fig:lrapx2']} and \ref{['fig:lrapx3']}.
  • Figure 4: Investigations into the training mechanism behind Random Masking.(a). Smaller trainable parameter ratio induces smaller hessian $\ell_2$ norm. (b). Longer training steps compensate small learning rates. (c). Smaller trainable parameter ratio gives more distant solutions. These figures present the results on SST-2 datasets. Additional Results on other datasets can be found in Figure \ref{['fig:small_norm_apx']}, \ref{['fig:more_steps_apx']} and \ref{['fig:distance_apx']}.
  • Figure 5: Random Masking v.s. Structured Masking. Structured Masking has a degraded and faster decaying performance. The complete results for Structured Masking can be found in Table \ref{['tab:structured_masking_apx']} and Table \ref{['tab:lr_structured_masking']}.
  • ...and 6 more figures

Theorems & Definitions (8)

  • Theorem 5.1
  • Proposition 5.2
  • Proposition 5.3
  • proof
  • proof
  • Lemma 1.1
  • proof
  • proof