Table of Contents
Fetching ...

TeZO: Empowering the Low-Rankness on the Temporal Dimension in the Zeroth-Order Optimization for Fine-tuning LLMs

Yan Sun, Tiansheng Huang, Liang Ding, Li Shen, Dacheng Tao

TL;DR

TeZO introduces a unified low-rank zeroth-order estimator that exploits both the intrinsic low-rank structure of per-iteration gradients and their similarity across time by modeling ZO perturbations as a 3D tensor and applying Canonical Polyadic Decomposition. The method reduces training overhead from generating $\mathcal{O}(\sqrt{d}T)$ perturbations to $\mathcal{O}(\sqrt{d}+T)$ and extends to memory-efficient variants for momentum and Adam optimizers. Theoretical analysis shows TeZO remains an unbiased gradient estimator with a convergence rate comparable to existing ZO methods, while experiments demonstrate substantial memory savings and competitive or superior performance on large-scale LLM fine-tuning tasks. These results suggest TeZO as a practical and scalable approach for efficient ZO-based fine-tuning of large language models.

Abstract

Zeroth-order optimization (ZO) has demonstrated remarkable promise in efficient fine-tuning tasks for Large Language Models (LLMs). In particular, recent advances incorporate the low-rankness of gradients, introducing low-rank ZO estimators to further reduce GPU memory consumption. However, most existing works focus solely on the low-rankness of each individual gradient, overlooking a broader property shared by all gradients throughout the training, i.e., all gradients approximately reside within a similar subspace. In this paper, we consider two factors together and propose a novel low-rank ZO estimator, TeZO, which captures the low-rankness across both the model and temporal dimension. Specifically, we represent ZO perturbations along the temporal dimension as a 3D tensor and employ Canonical Polyadic Decomposition (CPD) to extract each low-rank 2D matrix, significantly reducing the training cost. TeZO can also be easily extended to the Adam variant while consuming less memory than MeZO-SGD, and requiring about only 35% memory of MeZO-Adam. Both comprehensive theoretical analysis and extensive experimental research have validated its efficiency, achieving SOTA-comparable results with lower overhead of time and memory.

TeZO: Empowering the Low-Rankness on the Temporal Dimension in the Zeroth-Order Optimization for Fine-tuning LLMs

TL;DR

TeZO introduces a unified low-rank zeroth-order estimator that exploits both the intrinsic low-rank structure of per-iteration gradients and their similarity across time by modeling ZO perturbations as a 3D tensor and applying Canonical Polyadic Decomposition. The method reduces training overhead from generating perturbations to and extends to memory-efficient variants for momentum and Adam optimizers. Theoretical analysis shows TeZO remains an unbiased gradient estimator with a convergence rate comparable to existing ZO methods, while experiments demonstrate substantial memory savings and competitive or superior performance on large-scale LLM fine-tuning tasks. These results suggest TeZO as a practical and scalable approach for efficient ZO-based fine-tuning of large language models.

Abstract

Zeroth-order optimization (ZO) has demonstrated remarkable promise in efficient fine-tuning tasks for Large Language Models (LLMs). In particular, recent advances incorporate the low-rankness of gradients, introducing low-rank ZO estimators to further reduce GPU memory consumption. However, most existing works focus solely on the low-rankness of each individual gradient, overlooking a broader property shared by all gradients throughout the training, i.e., all gradients approximately reside within a similar subspace. In this paper, we consider two factors together and propose a novel low-rank ZO estimator, TeZO, which captures the low-rankness across both the model and temporal dimension. Specifically, we represent ZO perturbations along the temporal dimension as a 3D tensor and employ Canonical Polyadic Decomposition (CPD) to extract each low-rank 2D matrix, significantly reducing the training cost. TeZO can also be easily extended to the Adam variant while consuming less memory than MeZO-SGD, and requiring about only 35% memory of MeZO-Adam. Both comprehensive theoretical analysis and extensive experimental research have validated its efficiency, achieving SOTA-comparable results with lower overhead of time and memory.

Paper Structure

This paper contains 24 sections, 3 theorems, 32 equations, 8 figures, 9 tables, 1 algorithm.

Key Result

Theorem 1

Without loss of generality, we consider the 2D parameters $W\in\mathbb{R}^{m\times n}$. Its FO gradient is denoted as $\nabla_{W} f$ and ZO gradient is denoted as $\nabla_{W}^0 f$. When using the TeZO method to estimate the ZO gradient with rank $r$ and a sufficiently small perturbation rate $\rho$ where $\delta = 1 + mn + \frac{2mn}{r} + \frac{6(m+n)}{r} + \frac{10}{r}$.

Figures (8)

  • Figure 1: (a) and (b) are validation of the low-rankness of gradients. We fine-tune OPT-1.3B on SST-2 and calculate top-100 singular values of gradients of layers.9.self_attn.out_proj.weight. We then concatenate these singular value vectors and display them as a heat-map in (a). Then we concatenate the normalized gradient of each layer over a total of $T$ iterations into a matrix with the size of $d_l\times T$, calculate the top-100 singular values corresponding to layers and display them as a heat-map in (b). In (c), we record the GPU memory usage of MeZO, our TeZO, and corresponding variants on training OPT-13B model. We also provide more interesting experiments on the low-rankness and studies of subspace of gradients on LLaMA-7B in Appendix \ref{['ap:low_rank']}.
  • Figure 2: The ZO diagrams for LOZO, SubZO, and our TeZO method. LOZO and SubZO focus on estimating a single perturbation $Z_t$ as the product of low-rank matrices. TeZO construct the entire perturbation set ${\bm Z} = \{Z_t\}$ via the CPD in the 3D tensor.
  • Figure 3: GPU memory usage (a) and wall-clock time (b) for fine-tuning LLMs with RTE dataset on H100. SubZO does not provide other memory-efficient extensions. LOZO does not provide the memory-efficient Adam extension. More details are stated in Appendix \ref{['ap:memory and time']}.
  • Figure 4: Loss curves of LLaMA-7B on SST-2 and RTE datasets on ZO-SGD and ZO-Adam methods. We use gaussian_filter1d function in the scipy.ndimage lib to smooth curves with sigma=30.
  • Figure 5: We finetune LLaMA-7B on SST-2 to test low-rankness of gradients. We set the batchsize as 16 and train 500 steps with 8000 samples on a H200 device. The training loss decreases from 1.04 to 0.13. We analyze the low-rank properties of the $W_K$, $W_V$, $W_Q$ and $W_O$ parameters in the $6$-th, $12$-th, $18$-th, and $24$-th modules at each iteration ($W_K,W_V,W_Q,W_O\in\mathbb{R}^{4096\times4096}$). The white lines represent the indices where the singular values are 2% of the maximum singular value.
  • ...and 3 more figures

Theorems & Definitions (6)

  • Theorem 1: Expectation and Variance
  • Remark 1
  • Theorem 2: Convergence
  • Remark 2
  • Lemma 1
  • proof