Table of Contents
Fetching ...

Zeroth-Order Fine-Tuning of LLMs with Extreme Sparsity

Wentao Guo, Jikai Long, Yimeng Zeng, Zirui Liu, Xinyu Yang, Yide Ran, Jacob R. Gardner, Osbert Bastani, Christopher De Sa, Xiaodong Yu, Beidi Chen, Zhaozhuo Xu

TL;DR

This work tackles memory constraints in fine-tuning large language models by marrying zeroth-order optimization with extreme sparsity and quantization. It uncovers a Fisher-informed, transferably sparse pattern wherein updating only ~0.1% of parameters suffices to match or exceed full ZO fine-tuning performance, while quantizing the rest to 4-bit enables on-device training within 8 GiB GPUs. The authors provide theoretical convergence guarantees for sparse ZO-SGD and demonstrate strong empirical results across multiple 7B-scale models and diverse tasks, achieving notable wall-clock speedups and practical on-device personalization capabilities. The approach offers a scalable path to personalized LLMs on edge devices without sacrificing performance, with broad implications for privacy-preserving deployment and responsive user experiences.

Abstract

Zeroth-order optimization (ZO) is a memory-efficient strategy for fine-tuning Large Language Models using only forward passes. However, the application of ZO fine-tuning in memory-constrained settings such as mobile phones and laptops is still challenging since full precision forward passes are infeasible. In this study, we address this limitation by integrating sparsity and quantization into ZO fine-tuning of LLMs. Specifically, we investigate the feasibility of fine-tuning an extremely small subset of LLM parameters using ZO. This approach allows the majority of un-tuned parameters to be quantized to accommodate the constraint of limited device memory. Our findings reveal that the pre-training process can identify a set of "sensitive parameters" that can guide the ZO fine-tuning of LLMs on downstream tasks. Our results demonstrate that fine-tuning 0.1% sensitive parameters in the LLM with ZO can outperform the full ZO fine-tuning performance, while offering wall-clock time speedup. Additionally, we show that ZO fine-tuning targeting these 0.1% sensitive parameters, combined with 4 bit quantization, enables efficient ZO fine-tuning of an Llama2-7B model on a GPU device with less than 8 GiB of memory and notably reduced latency.

Zeroth-Order Fine-Tuning of LLMs with Extreme Sparsity

TL;DR

This work tackles memory constraints in fine-tuning large language models by marrying zeroth-order optimization with extreme sparsity and quantization. It uncovers a Fisher-informed, transferably sparse pattern wherein updating only ~0.1% of parameters suffices to match or exceed full ZO fine-tuning performance, while quantizing the rest to 4-bit enables on-device training within 8 GiB GPUs. The authors provide theoretical convergence guarantees for sparse ZO-SGD and demonstrate strong empirical results across multiple 7B-scale models and diverse tasks, achieving notable wall-clock speedups and practical on-device personalization capabilities. The approach offers a scalable path to personalized LLMs on edge devices without sacrificing performance, with broad implications for privacy-preserving deployment and responsive user experiences.

Abstract

Zeroth-order optimization (ZO) is a memory-efficient strategy for fine-tuning Large Language Models using only forward passes. However, the application of ZO fine-tuning in memory-constrained settings such as mobile phones and laptops is still challenging since full precision forward passes are infeasible. In this study, we address this limitation by integrating sparsity and quantization into ZO fine-tuning of LLMs. Specifically, we investigate the feasibility of fine-tuning an extremely small subset of LLM parameters using ZO. This approach allows the majority of un-tuned parameters to be quantized to accommodate the constraint of limited device memory. Our findings reveal that the pre-training process can identify a set of "sensitive parameters" that can guide the ZO fine-tuning of LLMs on downstream tasks. Our results demonstrate that fine-tuning 0.1% sensitive parameters in the LLM with ZO can outperform the full ZO fine-tuning performance, while offering wall-clock time speedup. Additionally, we show that ZO fine-tuning targeting these 0.1% sensitive parameters, combined with 4 bit quantization, enables efficient ZO fine-tuning of an Llama2-7B model on a GPU device with less than 8 GiB of memory and notably reduced latency.
Paper Structure (28 sections, 3 theorems, 25 equations, 14 figures, 6 tables)

This paper contains 28 sections, 3 theorems, 25 equations, 14 figures, 6 tables.

Key Result

Theorem 1

If we pick $\eta_t = 1 / (L (k + 2))$, under Assumptions assumption:bounded-gradient-error (bounded gradient error), assumption:l-smooth (Lipschitz smoothness), and assumption:sparse_mask (sparse sensitive parameters), we would have Moreover, if we still pick $\eta_t = 1 / (L (k + 2))$, with an extra Assumption assumption:pl-condition (P.L. condition), we would have

Figures (14)

  • Figure 1: Training & inference speed of Llama2-7B. As the sensitive sparse fine-tuning method achieves great performance via optimizing only 0.1% parameters (performance comparable to ZO full fine-tuning and 10% random subsets), during inference we achieve an end-to-end $1.49\times$ speedup, with $2.15\times$ speedup at sparse operations.
  • Figure 2: Cumulative normalized sum of coordinate-wise gradient square $[\nabla \mathcal{F}(\mathbf{w})]_i^2$ of linear layers during Llama2-7B llama2 fine-tuning. For each linear layer, we first sort parameters by the decreasing order of their gradient square value $[\nabla \mathcal{F}(\mathbf{w})]_i^2, i \in [d_\text{layer}]$, and we take the cumulative sum and normalize it to draw a blue curve, and the red-shaded region is the mean $\pm$ std of all blue curves. More similar figures are in Figure \ref{['fig:appendix:sparsity']}. We observe that roughly 0.1% parameters in all linear layers contribute about 50% gradient norm square.
  • Figure 3: Cumulative normalized gradient square values of Llama2-7B model's linear layers during fine-tuning. For each line, the colors represent the fraction of parameters and the line style represents the category. "task grad, dyn." refers to the sensitive parameters selected at the given timestep (x-axis), and "task grad, static" refers to the sensitive parameters selected before fine-tuning. "C4 grad, static" refers to the sensitive parameters selected with gradients taken from causal language modeling on C4 datasets 2019t5, and we keep it unchanged during fine-tuning. More similar figures are in Figure \ref{['fig:appendix:transferability']}.
  • Figure 4: On-device LLM personalization workflow via integrating sensitive sparse ZO optimization with quantization.
  • Figure 5: Optimizing sensitive parameters with C4 gradients versus optimizing weights with largest magnitude (weight outliers) and random subsets of weights. The trainable parameters are all determined before fine-tuning and other parameters are kept unchanged.
  • ...and 9 more figures

Theorems & Definitions (9)

  • Definition 1: Simultaneous Perturbation Stochastic Approximation (SPSA) spall1992multivariate
  • Definition 2: ZO-SGD update rule
  • Definition 3: Sensitive parameter mask
  • Definition 4: Sensitive sparse ZO-SGD update rule
  • Theorem 1: Convergence rate of sensitive sparse ZO-SGD (Definition \ref{['def:sensitive-sparse-sgd']})
  • Lemma 1: Sparse ZO surrogate gradient covariance
  • proof : Proof for Equation \ref{['eqn:l-smooth-theory']}, Theorem \ref{['thm:convergence-rate']}
  • Lemma 2: Sparse ZO surrogate gradient norm
  • proof : Proof for Equation \ref{['eqn:pl-condition']}, Theorem \ref{['thm:convergence-rate']}