Scaling Sparse Fine-Tuning to Large Language Models

Alan Ansell; Ivan Vulić; Hannah Sterz; Anna Korhonen; Edoardo M. Ponti

Scaling Sparse Fine-Tuning to Large Language Models

Alan Ansell, Ivan Vulić, Hannah Sterz, Anna Korhonen, Edoardo M. Ponti

TL;DR

This work Tackles the high memory cost of fine-tuning very large language models by introducing SpIEL, a memory-efficient sparse fine-tuning framework that maintains a small active set of parameter deltas. It presents two growth strategies, SpIEL-AG and SpIEL-MA, which either accumulate gradient information or approximate momentum via the SM3 optimizer to determine which weights to grow, while pruning inactive indices and updating deltas iteratively. The results show SpIEL-AG often surpasses LoRA and (IA)$^3$ across multiple model sizes and data mixtures, with SpIEL-MA offering a more memory-efficient variant; the approach remains effective under 4-bit quantization and is compatible with efficient optimizers. Overall, SpIEL enables scalable, high-performance sparse fine-tuning of LLMs like LLaMA 2 7B and 13B, providing a practical path toward adapting very large models to diverse tasks with reduced memory demands.

Abstract

Large Language Models (LLMs) are difficult to fully fine-tune (e.g., with instructions or human feedback) due to their sheer number of parameters. A family of parameter-efficient sparse fine-tuning methods have proven promising in terms of performance but their memory requirements increase proportionally to the size of the LLMs. In this work, we scale sparse fine-tuning to state-of-the-art LLMs like LLaMA 2 7B and 13B. We propose SpIEL, a novel sparse fine-tuning method which, for a desired density level, maintains an array of parameter indices and the deltas of these parameters relative to their pretrained values. It iterates over: (a) updating the active deltas, (b) pruning indices (based on the change of magnitude of their deltas) and (c) regrowth of indices. For regrowth, we explore two criteria based on either the accumulated gradients of a few candidate parameters or their approximate momenta estimated using the efficient SM3 optimizer. We experiment with instruction-tuning of LLMs on standard dataset mixtures, finding that SpIEL is often superior to popular parameter-efficient fine-tuning methods like LoRA (low-rank adaptation) in terms of performance and comparable in terms of run time. We additionally show that SpIEL is compatible with both quantization and efficient optimizers, to facilitate scaling to ever-larger model sizes. We release the code for SpIEL at https://github.com/AlanAnsell/peft and for the instruction-tuning experiments at https://github.com/ducdauge/sft-llm.

Scaling Sparse Fine-Tuning to Large Language Models

TL;DR

across multiple model sizes and data mixtures, with SpIEL-MA offering a more memory-efficient variant; the approach remains effective under 4-bit quantization and is compatible with efficient optimizers. Overall, SpIEL enables scalable, high-performance sparse fine-tuning of LLMs like LLaMA 2 7B and 13B, providing a practical path toward adapting very large models to diverse tasks with reduced memory demands.

Abstract

Paper Structure (24 sections, 15 equations, 4 figures, 5 tables, 2 algorithms)

This paper contains 24 sections, 15 equations, 4 figures, 5 tables, 2 algorithms.

Introduction
Background and Related Work
Parameter-Efficient and Memory-Efficient Fine-Tuning
LoRA
Sparse Fine-Tuning
Quantized PEFT
Method
Efficient SFT with Fixed ${\bm{\eta}}$
SpIEL-AG: Accumulated Gradient SpIEL
SpIEL-MA: Momentum-Approximation SpIEL
Experimental Setup
Training and Evaluation Data
Models and Baselines
Results
Main Results
...and 9 more sections

Figures (4)

Figure 1: A visualization of the proposed Sparse Fine-Tuning (SFT) method scaled to a Large Language Model (LLM). PEFT parameters consist of indices (arrows) and corresponding deltas (red squares) with respect to LLM parameters (blue squares). After initialization (1), PEFT deltas are updated for $S$ steps (2). Next, obsolete indices are dropped (3) and new indices are grown (4) according to either accumulated gradients or approximate momenta. The algorithm then returns to the update step (2) and is repeated iteratively.
Figure 2: Proportion of indices with a certain age (i.e., the iteration when they were last grown) of the converged ${\bm{\eta}}$ after training on the Flan v2 dataset.
Figure 3: SFT (lower right) versus LoRA (upper right) applied to a linear layer (the output projection of self-attention) of a Transformer block.
Figure 4: Hyperparameter search results.

Scaling Sparse Fine-Tuning to Large Language Models

TL;DR

Abstract

Scaling Sparse Fine-Tuning to Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)