Random Masking Finds Winning Tickets for Parameter Efficient Fine-tuning
Jing Xu, Jingzhao Zhang
TL;DR
The paper investigates the limits of parameter-efficient fine-tuning by introducing Random Masking, a baseline that trains only randomly unmasked parameters in a pretrained model. It demonstrates that, with a carefully chosen learning rate, Random Masking can match the performance of standard PEFT methods like LoRA while using far fewer trainable parameters, across NLP and vision tasks. The authors provide empirical evidence of a flatter loss landscape and more distant optimization trajectories under masking, and support these findings with theoretical analysis of an overparameterized linear model showing eigenvalue concentration and enlarged stable learning rates as masking becomes sparser. Overall, the work highlights the surprising expressiveness of pretrained models and suggests that aggressive sparsity in fine-tuning can be both effective and informative for understanding PEFT dynamics and pruning opportunities.
Abstract
Fine-tuning large language models (LLM) can be costly. Parameter-efficient fine-tuning (PEFT) addresses the problems by training a fraction of the parameters, whose success reveals the expressiveness and flexibility of pretrained models. This paper studies the limit of PEFT, by further simplifying its design and reducing the number of trainable parameters beyond standard setups. To this end, we use Random Masking to fine-tune the pretrained model. Despite its simplicity, we show that Random Masking is surprisingly effective: with a larger-than-expected learning rate, Random Masking can match the performance of standard PEFT algorithms such as LoRA on various tasks, using fewer trainable parameters. We provide both empirical and theoretical explorations into the success of Random Masking. We show that masking induces a flatter loss landscape and more distant solutions, which allows for and necessitates large learning rates.
