Sparse is Enough in Fine-tuning Pre-trained Large Language Models

Weixi Song; Zuchao Li; Lefei Zhang; Hai Zhao; Bo Du

Sparse is Enough in Fine-tuning Pre-trained Large Language Models

Weixi Song, Zuchao Li, Lefei Zhang, Hai Zhao, Bo Du

TL;DR

This work analyzes why pre-trained language models fine-tune efficiently by applying a PAC-Bayesian framework that treats pre-training as shifting the prior, yielding tighter generalization bounds through a smaller KL divergence between prior and posterior. It couples this theory with empirical evidence from loss landscapes and gradient distributions, revealing a quasi-sparse gradient structure after pre-training and a compressed searching space for fine-tuning. Based on these insights, the authors introduce Sparse Increment Fine-Tuning (SIFT), a gradient-based, component-sparse method that updates only the top-$x\%$ gradient components, implemented with memory-efficient backward hooks. Across GLUE and instruction-tuning tasks, SIFT demonstrates competitive performance with substantially fewer trainable parameters and improved parameter efficiency compared with full fine-tuning and common PEFT baselines. The approach offers a principled, scalable path to efficient fine-tuning of large language models, with practical benefits for resource-limited settings.

Abstract

With the prevalence of pre-training-fine-tuning paradigm, how to efficiently adapt the pre-trained model to the downstream tasks has been an intriguing issue. Parameter-Efficient Fine-Tuning (PEFT) methods have been proposed for low-cost adaptation. Although PEFT has demonstrated effectiveness and been widely applied, the underlying principles are still unclear. In this paper, we adopt the PAC-Bayesian generalization error bound, viewing pre-training as a shift of prior distribution which leads to a tighter bound for generalization error. We validate this shift from the perspectives of oscillations in the loss landscape and the quasi-sparsity in gradient distribution. Based on this, we propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT), and validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning. The code is accessible at https://github.com/song-wx/SIFT/.

Sparse is Enough in Fine-tuning Pre-trained Large Language Models

TL;DR

gradient components, implemented with memory-efficient backward hooks. Across GLUE and instruction-tuning tasks, SIFT demonstrates competitive performance with substantially fewer trainable parameters and improved parameter efficiency compared with full fine-tuning and common PEFT baselines. The approach offers a principled, scalable path to efficient fine-tuning of large language models, with practical benefits for resource-limited settings.

Abstract

Paper Structure (22 sections, 3 theorems, 16 equations, 9 figures, 7 tables)

This paper contains 22 sections, 3 theorems, 16 equations, 9 figures, 7 tables.

Introduction
Related Work
Understanding Pre-training-fine-tuning from a Distribution-shift Perspective
PAC-Bayesian Generalization Error Bounds
Visualization of Loss Landscape
Quasi-Sparse Gradient Distribution
Methodology
SIFT: Sparse Increment Fine-Tuning
A Memory-efficient Implementation of SIFT
Experiments
GLUE Benchmark
Instruction-tuning
Further Analysis
SIFT Vs. Random
Sparsity Rate Analysis
...and 7 more sections

Key Result

Theorem 1.1

If $\mathcal{H}$ is a finite hypothesis space, for any hypotheses $h \in \mathcal{H}$, any loss functions $l$ bounded in $[0,1]$, $0 < \delta < 1$, with a probability at least $1-\delta$ over the selection of n i.i.d. samples, .

Figures (9)

Figure 1: An intuitive explanation about prior distribution and posterior distribution. Random initialization is equivalent to assigning equal probabilities to hypotheses in the hypothesis space (the horizontal line referred as prior-random). Pretraining learns language features from extensive corpora, moving away from hypotheses that are not accurately expressive of language, equivalent to assigning them lower prior probabilities. Data-based, more precise posterior also have lower probabilities for hypotheses that fail to represent and understand language correctly. Therefore, compared to prior-random, the KL divergence between prior-pretrained and posterior is smaller.
Figure 2: 1-D (up) and 2-D (down) visualization of the shift of the loss landscape from random initialization to pre-trained initialization. The figures show the loss landscape transitions from low amplitude oscillations to high amplitude oscillations.
Figure 3: (a) depicts the gradient distribution of parameters in one layer of RoBERTa-Large on the MNLI dataset, showing different gradient distributions of the models trained from scratch and from pre-trained. Both distribution are approximately presenting a bell-shaped distribution similar to a normal distribution with a mean of zero. (b) and (c) are drawn on their respective scales, showing that compared to the model trained from scratch, the gradients of the pre-trained model are more concentrated around zero while they exists much larger gradient values. (d) illustrates the gradient proportion of the top x% components, indicating that the gradients of the pre-trained model hold a more extreme property, dominating 99% of the complete gradient norm with only 1% of the parameters.
Figure 4: (Left) compares the difference in the proportion of gradients accounted for by two different top 1% component selection methods: the top 1% of the current batch and the top 1% of the first batch. Although sticking with the first batch's top 1% as opposed to selecting the top 1% of the current batch results in some reductions of gradient information, the difference remains within an acceptable range. Avoiding frequent changes of components can improve training efficiency and preserve the historical information of Adam-like optimizers. In contrast, (Right) shows the gradient proportion variation in the model trained from scratch, where fixing the top 1% of the first batch leads to a greater reduction of the gradient information.
Figure 5: (1) For each parameter (P) that requires updating sparsely, we register a Sparse Parameter (SP) with the Sparse Gradient (SG), storing them in the form of values and indexes; (2) Acquire the computed gradient (G) by inserting a hook function; (3) Obtain partial components of the gradient through indexing to serve as the Sparse Gradient for the Sparse Parameter; (4) Use the Sparse Gradient to update the Sparse Parameter in the optimizer; (5) Use the Sparse Parameter to update the initial parameter.
...and 4 more figures

Theorems & Definitions (5)

Theorem 1.1
proof
Theorem 1.2
proof
Theorem 1.3

Sparse is Enough in Fine-tuning Pre-trained Large Language Models

TL;DR

Abstract

Sparse is Enough in Fine-tuning Pre-trained Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (5)