Table of Contents
Fetching ...

Second-Order Fine-Tuning without Pain for LLMs:A Hessian Informed Zeroth-Order Optimizer

Yanjun Zhao, Sizhe Dang, Haishan Ye, Guang Dai, Yi Qian, Ivor W. Tsang

TL;DR

This work proposes HiZOO, a diagonal Hessian informed zeroth-order optimizer which is the first work to leverage the diagonal Hessian to enhance zeroth-order optimizer for fine-tuning LLMs, and indicates that HiZOO improves model convergence, significantly reducing training steps and effectively enhancing model accuracy.

Abstract

Fine-tuning large language models (LLMs) with classic first-order optimizers entails prohibitive GPU memory due to the backpropagation process. Recent works have turned to zeroth-order optimizers for fine-tuning, which save substantial memory by using two forward passes. However, these optimizers are plagued by the heterogeneity of parameter curvatures across different dimensions. In this work, we propose HiZOO, a diagonal Hessian informed zeroth-order optimizer which is the first work to leverage the diagonal Hessian to enhance zeroth-order optimizer for fine-tuning LLMs. What's more, HiZOO avoids the expensive memory cost and only increases one forward pass per step. Extensive experiments on various models (350M~66B parameters) indicate that HiZOO improves model convergence, significantly reducing training steps and effectively enhancing model accuracy. Moreover, we visualize the optimization trajectories of HiZOO on test functions, illustrating its effectiveness in handling heterogeneous curvatures. Lastly, we provide theoretical proofs of convergence for HiZOO. Code is publicly available at https://anonymous.4open.science/r/HiZOO27F8.

Second-Order Fine-Tuning without Pain for LLMs:A Hessian Informed Zeroth-Order Optimizer

TL;DR

This work proposes HiZOO, a diagonal Hessian informed zeroth-order optimizer which is the first work to leverage the diagonal Hessian to enhance zeroth-order optimizer for fine-tuning LLMs, and indicates that HiZOO improves model convergence, significantly reducing training steps and effectively enhancing model accuracy.

Abstract

Fine-tuning large language models (LLMs) with classic first-order optimizers entails prohibitive GPU memory due to the backpropagation process. Recent works have turned to zeroth-order optimizers for fine-tuning, which save substantial memory by using two forward passes. However, these optimizers are plagued by the heterogeneity of parameter curvatures across different dimensions. In this work, we propose HiZOO, a diagonal Hessian informed zeroth-order optimizer which is the first work to leverage the diagonal Hessian to enhance zeroth-order optimizer for fine-tuning LLMs. What's more, HiZOO avoids the expensive memory cost and only increases one forward pass per step. Extensive experiments on various models (350M~66B parameters) indicate that HiZOO improves model convergence, significantly reducing training steps and effectively enhancing model accuracy. Moreover, we visualize the optimization trajectories of HiZOO on test functions, illustrating its effectiveness in handling heterogeneous curvatures. Lastly, we provide theoretical proofs of convergence for HiZOO. Code is publicly available at https://anonymous.4open.science/r/HiZOO27F8.
Paper Structure (33 sections, 3 theorems, 30 equations, 17 figures, 11 tables, 3 algorithms)

This paper contains 33 sections, 3 theorems, 30 equations, 17 figures, 11 tables, 3 algorithms.

Key Result

Theorem 3.2

Let the descent direction $g_\mu(\theta_t)$ defined as: Based on Assumption ass:L-ass:beta, if the update rule for $\theta$ is $\theta_{t+1} = \theta_t - \eta g_\mu(\theta_t)$ for a single step, then it's established that: Furthermore, given iteration number $T$, we choose the step size $\eta = \frac{1}{8\sqrt{T}L(\max_t\mathrm{tr}(\Sigma_t) +\beta_u)}$ and take $\theta_{\hbox{out}} = \theta_j$

Figures (17)

  • Figure 1: (Left) Comparison of HiZOO, MeZO and Adam. (Right) Heterogeneous curvatures example. HiZOO updates along the direction with greater curvature ($X$) and converges more quickly than MeZO. The corresponding loss curves are shown in Section \ref{['2D_curvature']}.
  • Figure 2: Performance of MeZO, HiZOO and HiZOO-L on SST2 task, when fine-tuning RoBERTa-large, OPT-13B, Llama3(8B) models. HiZOO can achieve 8$\times$ speedup and 1.55% absolute accuracy improvement compared with MeZO.
  • Figure 3: Optimization trajectories of Adam, MeZO and HiZOO on 3 test functions. We have labeled the number of iterations required for the loss to drop to 0.1.
  • Figure 4: Training loss curves when using Adam, MeZO and HiZOO to fine-tune Roberta-large on MNLI. The evaluation accuracy curves can be found in Figure \ref{['fig:app_roberta_eval_acc']} in Appendix \ref{['appendix_robert']}.
  • Figure 5: GPU memory consumption with different OPT models and tuning methods on MultiRC (400 tokens per example on average). More details can be found in Appendix \ref{['app:memory_time']}.
  • ...and 12 more figures

Theorems & Definitions (7)

  • Definition 3.1
  • Theorem 3.2
  • proof
  • proof
  • Lemma B.4
  • proof
  • Lemma B.5