Table of Contents
Fetching ...

Simultaneous Computation and Memory Efficient Zeroth-Order Optimizer for Fine-Tuning Large Language Models

Fei Wang, Li Shen, Liang Ding, Chao Xue, Ye Liu, Changxing Ding

TL;DR

A novel layer-wise sparse computation and memory efficient ZO optimizer, named LeZO, which accelerates training without compromising the performance of ZO optimization and incorporates layer-wise parameter sparsity in the process of simultaneous perturbation stochastic approximation and ZO stochastic gradient descent.

Abstract

Fine-tuning is powerful for adapting large language models to downstream tasks, but it often results in huge memory usages. A promising approach to mitigate this is using Zeroth-Order (ZO) optimization, which estimates gradients to replace First-Order (FO) gradient calculations, albeit with longer training time due to its stochastic nature. By revisiting the Memory-efficient ZO (MeZO) optimizer, we discover that the full-parameter perturbation and updating processes consume over 50% of its overall fine-tuning time cost. Based on these observations, we introduce a novel layer-wise sparse computation and memory efficient ZO optimizer, named LeZO. LeZO treats layers as fundamental units for sparsification and dynamically perturbs different parameter subsets in each step to achieve full-parameter fine-tuning. LeZO incorporates layer-wise parameter sparsity in the process of simultaneous perturbation stochastic approximation (SPSA) and ZO stochastic gradient descent (ZO-SGD). It achieves accelerated computation during perturbation and updating processes without additional memory overhead. We conduct extensive experiments with the OPT model family on the SuperGLUE benchmark and two generative tasks. The experiments show that LeZO accelerates training without compromising the performance of ZO optimization. Specifically, it achieves over 3x speedup compared to MeZO on the SST-2, BoolQ, and Copa tasks.

Simultaneous Computation and Memory Efficient Zeroth-Order Optimizer for Fine-Tuning Large Language Models

TL;DR

A novel layer-wise sparse computation and memory efficient ZO optimizer, named LeZO, which accelerates training without compromising the performance of ZO optimization and incorporates layer-wise parameter sparsity in the process of simultaneous perturbation stochastic approximation and ZO stochastic gradient descent.

Abstract

Fine-tuning is powerful for adapting large language models to downstream tasks, but it often results in huge memory usages. A promising approach to mitigate this is using Zeroth-Order (ZO) optimization, which estimates gradients to replace First-Order (FO) gradient calculations, albeit with longer training time due to its stochastic nature. By revisiting the Memory-efficient ZO (MeZO) optimizer, we discover that the full-parameter perturbation and updating processes consume over 50% of its overall fine-tuning time cost. Based on these observations, we introduce a novel layer-wise sparse computation and memory efficient ZO optimizer, named LeZO. LeZO treats layers as fundamental units for sparsification and dynamically perturbs different parameter subsets in each step to achieve full-parameter fine-tuning. LeZO incorporates layer-wise parameter sparsity in the process of simultaneous perturbation stochastic approximation (SPSA) and ZO stochastic gradient descent (ZO-SGD). It achieves accelerated computation during perturbation and updating processes without additional memory overhead. We conduct extensive experiments with the OPT model family on the SuperGLUE benchmark and two generative tasks. The experiments show that LeZO accelerates training without compromising the performance of ZO optimization. Specifically, it achieves over 3x speedup compared to MeZO on the SST-2, BoolQ, and Copa tasks.

Paper Structure

This paper contains 15 sections, 3 theorems, 28 equations, 7 figures, 4 tables, 1 algorithm.

Key Result

Lemma 1

Let ${\mathcal{L}}_{{\mathcal{R}}}= \mathbb{E}_{{\mathcal{R}}} [ {\mathcal{L}}({\bm{\theta}} + \epsilon {\bm{z}}^{'})]$. The relationship between the model's sparse gradient $\widehat{\nabla}_{{\bm{\theta}}^{'}} {\mathcal{L}}_{{\mathcal{R}}}({\bm{\theta}})$ and the estimated ZO sparse gradient $\wid

Figures (7)

  • Figure 1: LeZO achieves a $3.4\times$ speedup for fine-tuning the OPT-13b model in run-time compared with MeZO on the SST-2 dataset.
  • Figure 2: Proportion of computational time cost for each operation in a single step when fine-tuning the OPT-13b model on the SST-2 dataset using MeZO.
  • Figure 3: Impact of learning rate and sparsity ratio on fine-tuning models using LeZO on the SST-2 task. Results with accuracy exceeding 90% are displayed. Experiments were conducted with a single random seed. "Dropout Number" indicates the number of sparse layers in the OPT 13B model. To enhance clarity, the learning rates are logarithmically scaled. The curves mapping onto the straight lines in the lower plane are used to illustrate the range of learning rates that lead to improved performance after sparsifying different numbers of layers in the model. The performance of MeZO performance is the red line.
  • Figure 4: Correlation between the sparsity ratio and runtime in fine-tuning the OPT-13b model using LeZO on the SST-2 task. This figure presents the optimal experimental results at various sparsity levels from \ref{['fig:lr-dropnum']}, annotated with different colored pentagrams. The purple line indicates the total time required for fine-tuning in the corresponding experiments.
  • Figure 5: Comparison in computation efficiency between LeZO and MeZO on various tasks.
  • ...and 2 more figures

Theorems & Definitions (11)

  • Definition 1: Simultaneous Perturbation Stochastic Approximation, SPSA spall1992multivariate
  • Definition 2: Zeroth-Order Stochastic Gradient Descent, ZO-SGD malladi2023finetuning
  • Definition 3: Layer-wise Sparse SPSA
  • Definition 4: LeZO-SGD
  • Remark 1
  • Lemma 1: Unbiased Estimation of Sparse Gradient
  • Lemma 2: Relationship between Sparse Gradient and Estimate Value
  • Lemma 3: Convergence of LeZO
  • proof : Proof of \ref{['lem:Unbiased_Estimation']}
  • proof : Proof of \ref{['lem:relation']}
  • ...and 1 more