Addax: Utilizing Zeroth-Order Gradients to Improve Memory Efficiency and Performance of SGD for Fine-Tuning Language Models

Zeman Li; Xinwei Zhang; Peilin Zhong; Yuan Deng; Meisam Razaviyayn; Vahab Mirrokni

Addax: Utilizing Zeroth-Order Gradients to Improve Memory Efficiency and Performance of SGD for Fine-Tuning Language Models

Zeman Li, Xinwei Zhang, Peilin Zhong, Yuan Deng, Meisam Razaviyayn, Vahab Mirrokni

TL;DR

Addax addresses the memory bottlenecks of fine-tuning large language models with Adam by mixing zeroth- and first-order gradient updates in a data-aware, in-place framework. By partitioning data by sequence length at each minibatch and applying zeroth-order updates to long sequences while using first-order updates on shorter ones, Addax achieves memory comparable to memory-efficient baselines like MeZO but with faster convergence and higher final accuracy. The paper provides nonconvex and strongly convex convergence analyses, showing near-dimension-free rates and looser hyperparameter requirements than MeZO, and validates the approach across multiple models (OPT, Llama, RoBERTa) and tasks, reporting substantial accuracy gains and speedups with manageable memory. Overall, Addax offers a practical, scalable route to high-quality fine-tuning on large models under limited compute resources.

Abstract

Fine-tuning language models (LMs) with the Adam optimizer often demands excessive memory, limiting accessibility. The "in-place" version of Stochastic Gradient Descent (IP-SGD) and Memory-Efficient Zeroth-order Optimizer (MeZO) have been proposed to address this. However, IP-SGD still requires substantial memory, and MeZO suffers from slow convergence and degraded final performance due to its zeroth-order nature. This paper introduces Addax, a novel method that improves both memory efficiency and performance of IP-SGD by integrating it with MeZO. Specifically, Addax computes zeroth- or first-order gradients of data points in the minibatch based on their memory consumption, combining these gradient estimates to update directions. By computing zeroth-order gradients for data points that require more memory and first-order gradients for others, Addax overcomes the slow convergence of MeZO and the excessive memory requirement of IP-SGD. Additionally, the zeroth-order gradient acts as a regularizer for the first-order gradient, further enhancing the model's final performance. Theoretically, we establish the convergence of Addax under mild assumptions, demonstrating faster convergence and less restrictive hyper-parameter choices than MeZO. Our experiments with diverse LMs and tasks show that Addax consistently outperforms MeZO regarding accuracy and convergence speed while having a comparable memory footprint. When fine-tuning OPT-13B with one A100 GPU, on average, Addax outperforms MeZO in accuracy/F1 score by 14% and runs 15x faster while using memory similar to MeZO. In our experiments on the larger OPT-30B model, on average, Addax outperforms MeZO in terms of accuracy/F1 score by >16 and runs 30x faster on a single H100 GPU. Moreover, Addax surpasses the performance of standard fine-tuning approaches, such as IP-SGD and Adam, in most tasks with significantly less memory requirement.

Addax: Utilizing Zeroth-Order Gradients to Improve Memory Efficiency and Performance of SGD for Fine-Tuning Language Models

TL;DR

Abstract

Paper Structure (38 sections, 13 theorems, 53 equations, 11 figures, 15 tables, 3 algorithms)

This paper contains 38 sections, 13 theorems, 53 equations, 11 figures, 15 tables, 3 algorithms.

Introduction
Notations and Preliminaries
Notations
Memory-Efficient Zeroth-order Optimizer
Preliminaries and Major Observations
Addax
Algorithm Overview
Theoretical Analysis
Further Discussions on the Benefits of Utilizing Zeroth-Order Gradients
Experiments
Conclusion, Broader Impact, and Limitations
More discussion on Addax
More discussion on the in-place updates
Discussion on related works
Experiment setup
...and 23 more sections

Key Result

Theorem 3.1

Assume that the loss $\ell$ is Lipschitz smooth, and the first-order stochastic gradient is unbiased and has bounded variance. Choosing $\eta_t = \eta = \mathcal{O}(d^{-1/2}T^{-1/2})$ and $\epsilon = \mathcal{O}(d^{-3/4}T^{-1/4}))$ in Addax leads to the convergence rate: Further, by choosing the optimal $\alpha = \frac{K^0}{K^0+dK^1}$, we obtain the convergence rate $\mathcal{O}\left(\sqrt{\frac{

Figures (11)

Figure 1: Accuracy/F-1 score, memory, and convergence time resulted from fine-tuning the OPT-13B model with various algorithms on one A100 (40GB) GPU, except for Adam, which runs on five GPUs. The label "OOM" means the run encounters an out-of-memory error during fine-tuning, even with the smallest batch size. Addax consistently outperforms other methods in terms of Accuracy, with GPU memory consumption comparable to MeZO. Except for Adam, all other methods are running in 16-bit mode. We do not report the time for Adam as it requires five GPUs. The exact numbers can be found in Table \ref{['tab:opt-13B-main_results']} in Appendix \ref{['app: OPT-13B-main-results']}.
Figure 2: Accuracy/F-1 score, memory, and convergence time resulted from fine-tuning the OPT-30B model with various algorithms on one H100 (80GB) GPU. The label "OOM" means the run encounters out-of-memory errors during fine-tuning. Addax leads to the best final accuracy in all experiments and has a comparable memory footprint to MeZO while converging orders of magnitude faster. The exact numbers related to this figure can be found in Table \ref{['tab:opt-30B-main_results']} in Appendix \ref{['app: opt-30B-main-results']}.
Figure 3: Left: Memory profile of fine-tuning OPT-13B with IP-SGD and MeZO on a synthetic dataset with a fixed sequence length of $300$. Right: Fine-tuning OPT-13B using IP-SGD and small batch sizes (BS) can outperform Adam while consuming significantly lower memory.
Figure 4: Memory profiling of SGD, IP-SGD, and MeZO on OPT-13B fine-tuning with synthetic datasets with varying sequence lengths (fixing batch size = 8).
Figure 5: Left: An illustration of loss function $\mathcal{L}(\bm{\theta})$ alongside its Gaussian smoothed version $\widehat{\mathcal{L}}(\bm{\theta})$. Minimizing $(1-\alpha) \mathcal{L}(\boldsymbol{\theta})+\alpha \widehat{\mathcal{L}}(\boldsymbol{\theta})$ can help escape sharp local minima and find higher quality solutions. Right: The regularization effect of zeroth-order gradient estimates on first-order gradient estimates. We fix $K^1 = 4$ in Addax across experiments while varying $K^0$ from 0 to 16. In the special case where $K^0 = 0$, Addax reduces to IP-SGD.
...and 6 more figures

Theorems & Definitions (16)

Theorem 3.1: Informal
Remark 3.2
Remark 3.3
Remark 3.4
Theorem 3.5: Informal
Lemma G.5: gao2018information, Lemma 4.1 (b)
Lemma G.6: malladi2023MeZO, Lemma 2
Lemma G.7: zhang2023dpzero, Lemma C.1 (iv)
Theorem G.8
Corollary G.9
...and 6 more

Addax: Utilizing Zeroth-Order Gradients to Improve Memory Efficiency and Performance of SGD for Fine-Tuning Language Models

TL;DR

Abstract

Addax: Utilizing Zeroth-Order Gradients to Improve Memory Efficiency and Performance of SGD for Fine-Tuning Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (16)