Sparse is Enough in Fine-tuning Pre-trained Large Language Models
Weixi Song, Zuchao Li, Lefei Zhang, Hai Zhao, Bo Du
TL;DR
This work analyzes why pre-trained language models fine-tune efficiently by applying a PAC-Bayesian framework that treats pre-training as shifting the prior, yielding tighter generalization bounds through a smaller KL divergence between prior and posterior. It couples this theory with empirical evidence from loss landscapes and gradient distributions, revealing a quasi-sparse gradient structure after pre-training and a compressed searching space for fine-tuning. Based on these insights, the authors introduce Sparse Increment Fine-Tuning (SIFT), a gradient-based, component-sparse method that updates only the top-$x\%$ gradient components, implemented with memory-efficient backward hooks. Across GLUE and instruction-tuning tasks, SIFT demonstrates competitive performance with substantially fewer trainable parameters and improved parameter efficiency compared with full fine-tuning and common PEFT baselines. The approach offers a principled, scalable path to efficient fine-tuning of large language models, with practical benefits for resource-limited settings.
Abstract
With the prevalence of pre-training-fine-tuning paradigm, how to efficiently adapt the pre-trained model to the downstream tasks has been an intriguing issue. Parameter-Efficient Fine-Tuning (PEFT) methods have been proposed for low-cost adaptation. Although PEFT has demonstrated effectiveness and been widely applied, the underlying principles are still unclear. In this paper, we adopt the PAC-Bayesian generalization error bound, viewing pre-training as a shift of prior distribution which leads to a tighter bound for generalization error. We validate this shift from the perspectives of oscillations in the loss landscape and the quasi-sparsity in gradient distribution. Based on this, we propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT), and validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning. The code is accessible at https://github.com/song-wx/SIFT/.
