Table of Contents
Fetching ...

Enhancing Zeroth-order Fine-tuning for Language Models with Low-rank Structures

Yiming Chen, Yuan Zhang, Liyuan Cao, Kun Yuan, Zaiwen Wen

TL;DR

This paper proposes a low-rank ZO gradient estimator and introduces a novel low-rank ZO algorithm (LOZO) that effectively captures this structure in LLMs and provides convergence guarantees for LOZO by framing it as a subspace optimization method.

Abstract

Parameter-efficient fine-tuning (PEFT) significantly reduces memory costs when adapting large language models (LLMs) for downstream applications. However, traditional first-order (FO) fine-tuning algorithms incur substantial memory overhead due to the need to store activation values for back-propagation during gradient computation, particularly in long-context fine-tuning tasks. Zeroth-order (ZO) algorithms offer a promising alternative by approximating gradients using finite differences of function values, thus eliminating the need for activation storage. Nevertheless, existing ZO methods struggle to capture the low-rank gradient structure common in LLM fine-tuning, leading to suboptimal performance. This paper proposes a low-rank ZO gradient estimator and introduces a novel low-rank ZO algorithm (LOZO) that effectively captures this structure in LLMs. We provide convergence guarantees for LOZO by framing it as a subspace optimization method. Additionally, its low-rank nature enables LOZO to integrate with momentum techniques while incurring negligible extra memory costs. Extensive experiments across various model sizes and downstream tasks demonstrate that LOZO and its momentum-based variant outperform existing ZO methods and closely approach the performance of FO algorithms.

Enhancing Zeroth-order Fine-tuning for Language Models with Low-rank Structures

TL;DR

This paper proposes a low-rank ZO gradient estimator and introduces a novel low-rank ZO algorithm (LOZO) that effectively captures this structure in LLMs and provides convergence guarantees for LOZO by framing it as a subspace optimization method.

Abstract

Parameter-efficient fine-tuning (PEFT) significantly reduces memory costs when adapting large language models (LLMs) for downstream applications. However, traditional first-order (FO) fine-tuning algorithms incur substantial memory overhead due to the need to store activation values for back-propagation during gradient computation, particularly in long-context fine-tuning tasks. Zeroth-order (ZO) algorithms offer a promising alternative by approximating gradients using finite differences of function values, thus eliminating the need for activation storage. Nevertheless, existing ZO methods struggle to capture the low-rank gradient structure common in LLM fine-tuning, leading to suboptimal performance. This paper proposes a low-rank ZO gradient estimator and introduces a novel low-rank ZO algorithm (LOZO) that effectively captures this structure in LLMs. We provide convergence guarantees for LOZO by framing it as a subspace optimization method. Additionally, its low-rank nature enables LOZO to integrate with momentum techniques while incurring negligible extra memory costs. Extensive experiments across various model sizes and downstream tasks demonstrate that LOZO and its momentum-based variant outperform existing ZO methods and closely approach the performance of FO algorithms.

Paper Structure

This paper contains 25 sections, 8 theorems, 56 equations, 4 figures, 7 tables, 2 algorithms.

Key Result

Theorem 4.4

Under Assumptions ass-smooth and ass-orthogonal, and letting $T = K\nu$, with suitable choices of $\alpha$ and $\epsilon$, the sequence of the $k\nu$-th variables $\{{\boldsymbol{X}}^{k\nu}\}$ generated by LOZO converges at the following rate: where $\Delta_0 := f({\boldsymbol{X}}^0) - f^*$, $\tilde{d} =\sum_{\ell=1}^{\mathcal{L}} (m_\ell n_\ell^2/r_\ell)$ and $d = \sum_{\ell=1}^{\mathcal{L}} m_\

Figures (4)

  • Figure 1: The low-rank structure of the gradients encountered in the fine-tuning of LLMs, demoenstrated using the OPT-1.3B model with the COPA dataset, where the gradient matrices have dimensions of $2048 \times 2048$. For both two figures, we report only the 100 largest singular values. Left: Singular value distribution of the gradient of the attention $Q$ matrix in layer 10 across different training steps. Right: Singular value distribution of the gradient of the attention $V$ matrix across different layers at training step 50.
  • Figure 2: The figures illustrate the performance of different algorithms on RoBERTa-large across three tasks (SNLI, MNLI, and RTE), with the left panel corresponding to $k=512$ and the right panel corresponding to $k=16$. Detailed numerical results are provided in Table \ref{['appen-tab-roberta']}.
  • Figure 3: Left: Loss curves of OPT-13B on SQuAD dataset. Middle: Loss curves of OPT-30B on SST-2 dataset. Right: Loss curves of OPT-66B on WIC dataset.
  • Figure 4: Left: Loss curves of OPT-1.3B on RTE dataset across different rank $r$. Right: Loss curves of OPT-1.3B on SST-2 dataset across different value $\nu$.

Theorems & Definitions (16)

  • Remark 4.3
  • Theorem 4.4
  • Proposition A.1
  • proof
  • Lemma B.1
  • proof
  • Lemma B.2
  • proof
  • Lemma B.3: Section 2 in nesterov2017random
  • proof
  • ...and 6 more