Table of Contents
Fetching ...

Sparse Gradient Compression for Fine-Tuning Large Language Models

David H. Yang, Mohammad Mohammadi Amiri, Tejaswini Pedapati, Subhajit Chaudhury, Pin-Yu Chen

TL;DR

This work tackles the memory bottleneck in fine-tuning large language models by introducing Sparse Gradient Compression (SGC), a framework that updates AdamW optimizer states in a $k$-dimensional subspace independent of the parameter size $d$. By sparsifying gradients to $s$ non-zeros and projecting to a subspace with a fixed projection ${\bm{A}}$, SGC operates in $\mathbb{R}^k$ and uses compressed sensing with Orthogonal Matching Pursuit (OMP) to recover full-dimension updates, yielding memory-efficient yet competitive performance. The authors further propose two efficient variants, MESGC (gradient chunking) and CESGC (double compression with SVD-based projection), and provide a memory analysis showing substantial reductions in optimizer-state storage compared to GaLore and LoRA. Empirical results on LLaMA and Mistral models across commonsense, knowledge, and small-data tasks demonstrate that SGC achieves comparable or superior accuracy with far fewer optimizer-state parameters, particularly under data-limited and memory-limited scenarios. This approach offers a flexible, granular memory-performance tradeoff for PEFT and opens avenues for applying gradient-compression techniques beyond language models.

Abstract

Fine-tuning large language models (LLMs) for downstream tasks has become increasingly crucial due to their widespread use and the growing availability of open-source models. However, the high memory costs associated with fine-tuning remain a significant challenge, especially as models increase in size. To address this, parameter efficient fine-tuning (PEFT) methods have been proposed to minimize the number of parameters required for fine-tuning LLMs. However, these approaches often tie the number of optimizer states to dimensions of model parameters, limiting flexibility and control during fine-tuning. In this paper, we propose sparse gradient compression (SGC), a training regime designed to address these limitations. Our approach leverages inherent sparsity in gradients to compress optimizer states by projecting them onto a low-dimensonal subspace, with dimensionality independent of the original model's parameters. By enabling optimizer state updates in an arbitrary low-dimensional subspace, SGC offers a flexible tradeoff between memory efficiency and performance. We demonstrate through experiments that SGC can decrease memory usage in optimizer states more effectively than existing PEFT methods. Furthermore, by fine-tuning LLMs on various downstream tasks, we show that SGC can deliver superior performance while substantially lowering optimizer state memory requirements, particularly in both data-limited and memory-limited settings.

Sparse Gradient Compression for Fine-Tuning Large Language Models

TL;DR

This work tackles the memory bottleneck in fine-tuning large language models by introducing Sparse Gradient Compression (SGC), a framework that updates AdamW optimizer states in a -dimensional subspace independent of the parameter size . By sparsifying gradients to non-zeros and projecting to a subspace with a fixed projection , SGC operates in and uses compressed sensing with Orthogonal Matching Pursuit (OMP) to recover full-dimension updates, yielding memory-efficient yet competitive performance. The authors further propose two efficient variants, MESGC (gradient chunking) and CESGC (double compression with SVD-based projection), and provide a memory analysis showing substantial reductions in optimizer-state storage compared to GaLore and LoRA. Empirical results on LLaMA and Mistral models across commonsense, knowledge, and small-data tasks demonstrate that SGC achieves comparable or superior accuracy with far fewer optimizer-state parameters, particularly under data-limited and memory-limited scenarios. This approach offers a flexible, granular memory-performance tradeoff for PEFT and opens avenues for applying gradient-compression techniques beyond language models.

Abstract

Fine-tuning large language models (LLMs) for downstream tasks has become increasingly crucial due to their widespread use and the growing availability of open-source models. However, the high memory costs associated with fine-tuning remain a significant challenge, especially as models increase in size. To address this, parameter efficient fine-tuning (PEFT) methods have been proposed to minimize the number of parameters required for fine-tuning LLMs. However, these approaches often tie the number of optimizer states to dimensions of model parameters, limiting flexibility and control during fine-tuning. In this paper, we propose sparse gradient compression (SGC), a training regime designed to address these limitations. Our approach leverages inherent sparsity in gradients to compress optimizer states by projecting them onto a low-dimensonal subspace, with dimensionality independent of the original model's parameters. By enabling optimizer state updates in an arbitrary low-dimensional subspace, SGC offers a flexible tradeoff between memory efficiency and performance. We demonstrate through experiments that SGC can decrease memory usage in optimizer states more effectively than existing PEFT methods. Furthermore, by fine-tuning LLMs on various downstream tasks, we show that SGC can deliver superior performance while substantially lowering optimizer state memory requirements, particularly in both data-limited and memory-limited settings.

Paper Structure

This paper contains 25 sections, 1 theorem, 23 equations, 3 figures, 11 tables, 4 algorithms.

Key Result

Theorem 1

Let ${\bm{G}}, \tilde{{\bm{G}}}'$ and $\tilde{{\bm{G}}}$ be as in Definition def:chunk_sparsification. Then, it holds that where $G_{\text{max}}$ is an upper bound on $\mathbb{E}\bigl[\|\tilde{{\bm{G}}}'\|_2^2\bigr]$.

Figures (3)

  • Figure 1: Diagram comparing SGC (green) and PEFT methods LoRA and GaLore (blue) in terms of the dimension of optimizer states compared to full fine-tuning. SGC enables a lower minimum and finer granularity for the number of optimizer states since it is independent of parameter dimensions.
  • Figure 2: (a). CESGC outperforms both GaLore and LoRA when fine-tuning with limited data on BoolQ. (b). Plot showing improvement of accuracy of CESGC when using a minimal number of optimizer states. Hollow blue points are interpolated values that indicate the granularity of CESGC across optimizer states.
  • Figure 3: Ablation study for effects of number of chunks $c$, sparsity $s$, and constant $\kappa$. (a). Average accuracy with varying $c$ and constant $s$. (b). Average accuracy with varying $s$ and constant $c$. (c). Average accuracy with varying $\kappa$.

Theorems & Definitions (3)

  • Definition 1: Chunk-based $s$-sparsification
  • Theorem 1: Worst-case bound on chunk-based vs. global sparsification
  • proof