Table of Contents
Fetching ...

Harmony in Divergence: Towards Fast, Accurate, and Memory-efficient Zeroth-order LLM Fine-tuning

Qitao Tan, Jun Liu, Zheng Zhan, Caiwei Ding, Yanzhi Wang, Xiaolong Ma, Jaewoo Lee, Jin Lu, Geng Yuan

TL;DR

The paper addresses the memory bottleneck of first-order fine-tuning for large language models and the slower convergence of zeroth-order methods. It introduces DiZO, a divergence-driven zeroth-order optimization that uses anchor-based, learnable projections to enforce layer-wise, FO-like update magnitudes while preserving ZO forward-pass efficiency. Through a two-stage process and carefully stabilized projection learning, DiZO achieves faster convergence and higher accuracy across RoBERTa-large, OPT, and Llama models, with up to 48% reductions in GPU hours and substantial memory savings. The approach is shown to be compatible with LoRA PEFT and backed by a convergence analysis, extensive experiments, and ablations, highlighting its practical impact for memory-constrained deployment of large models.

Abstract

Large language models (LLMs) excel across various tasks, but standard first-order (FO) fine-tuning demands considerable memory, significantly limiting real-world deployment. Recently, zeroth-order (ZO) optimization stood out as a promising memory-efficient training paradigm, avoiding backward passes and relying solely on forward passes for gradient estimation, making it attractive for resource-constrained scenarios. However, ZO method lags far behind FO method in both convergence speed and accuracy. To bridge the gap, we introduce a novel layer-wise divergence analysis that uncovers the distinct update pattern of FO and ZO optimization. Aiming to resemble the learning capacity of FO method from the findings, we propose Divergence-driven Zeroth-Order (DiZO) optimization. DiZO conducts divergence-driven layer adaptation by incorporating projections to ZO updates, generating diverse-magnitude updates precisely scaled to layer-wise individual optimization needs. Our results demonstrate that DiZO significantly reduces the needed iterations for convergence without sacrificing throughput, cutting training GPU hours by up to 48\% on various datasets. Moreover, DiZO consistently outperforms the representative ZO baselines in fine-tuning RoBERTa-large, OPT-series, and Llama-series on downstream tasks and, in some cases, even surpasses memory-intensive FO fine-tuning. Our code is released at https://github.com/Skilteee/DiZO.

Harmony in Divergence: Towards Fast, Accurate, and Memory-efficient Zeroth-order LLM Fine-tuning

TL;DR

The paper addresses the memory bottleneck of first-order fine-tuning for large language models and the slower convergence of zeroth-order methods. It introduces DiZO, a divergence-driven zeroth-order optimization that uses anchor-based, learnable projections to enforce layer-wise, FO-like update magnitudes while preserving ZO forward-pass efficiency. Through a two-stage process and carefully stabilized projection learning, DiZO achieves faster convergence and higher accuracy across RoBERTa-large, OPT, and Llama models, with up to 48% reductions in GPU hours and substantial memory savings. The approach is shown to be compatible with LoRA PEFT and backed by a convergence analysis, extensive experiments, and ablations, highlighting its practical impact for memory-constrained deployment of large models.

Abstract

Large language models (LLMs) excel across various tasks, but standard first-order (FO) fine-tuning demands considerable memory, significantly limiting real-world deployment. Recently, zeroth-order (ZO) optimization stood out as a promising memory-efficient training paradigm, avoiding backward passes and relying solely on forward passes for gradient estimation, making it attractive for resource-constrained scenarios. However, ZO method lags far behind FO method in both convergence speed and accuracy. To bridge the gap, we introduce a novel layer-wise divergence analysis that uncovers the distinct update pattern of FO and ZO optimization. Aiming to resemble the learning capacity of FO method from the findings, we propose Divergence-driven Zeroth-Order (DiZO) optimization. DiZO conducts divergence-driven layer adaptation by incorporating projections to ZO updates, generating diverse-magnitude updates precisely scaled to layer-wise individual optimization needs. Our results demonstrate that DiZO significantly reduces the needed iterations for convergence without sacrificing throughput, cutting training GPU hours by up to 48\% on various datasets. Moreover, DiZO consistently outperforms the representative ZO baselines in fine-tuning RoBERTa-large, OPT-series, and Llama-series on downstream tasks and, in some cases, even surpasses memory-intensive FO fine-tuning. Our code is released at https://github.com/Skilteee/DiZO.

Paper Structure

This paper contains 48 sections, 1 theorem, 48 equations, 8 figures, 21 tables, 1 algorithm.

Key Result

Theorem 5.3

Under Assumptions assumption_smoothness--assumption_projection, suppose DiZO runs for $T$ iterations with step size $\eta = c/\sqrt{T}$ for a sufficiently small constant $c>0$. Then there exist constants such that

Figures (8)

  • Figure 1: Comparison of the training dynamics of ZO and FO methods. For the upper subfigure, $W_{K},W_{V},W_{Q},W_{O}$ indicate the corresponding weight matrix in the attention module.
  • Figure 2: Experiments on RoBERTa-large. DiZO outperforms the baselines with and without LoRA. Detailed numbers are presented in Table \ref{['roberta-main']}, and the loss trajectory is shown in Figure \ref{['speed_roberta']}.
  • Figure 3: Experiment result on Llama3-3B and Llama3-8B. More results and detailed numbers are shown in Appendix \ref{['Llama']}.
  • Figure 4: Experiment results on OPT-6.7B (with 1000 training samples).
  • Figure 4: Comparison between MeZO and DiZO on convergence iteration, forward pass, and training GPU hours across multiple datasets. Results are presented as proportions, with the percentage of saved GPU hours highlighted for each dataset.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Theorem 5.3
  • proof