Table of Contents
Fetching ...

Fine-Tuning Language Models with Just Forward Passes

Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, Sanjeev Arora

TL;DR

MeZO introduces a memory-efficient zeroth-order optimizer that enables fine-tuning of trillion-parameter language models with memory comparable to inference. By adapting SPSA gradient estimation into an in-place update, MeZO achieves substantial memory and GPU-hour savings while remaining effective with full-parameter tuning and PEFT, and even supports non-differentiable objectives. The authors provide per-step and global convergence analyses showing that, under a local low-rank Hessian assumption, MeZO’s convergence can be dimension-free and significantly faster than expected given the parameter count. Empirically, MeZO outperforms zero-shot, ICL, and linear probing across model types and scales, approaching or matching standard fine-tuning on many tasks, and scales up to 66B parameters, indicating strong practical impact for scalable LM adaptation.

Abstract

Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zerothorder optimizer (MeZO), adapting the classical ZO-SGD method to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference. For example, with a single A100 80GB GPU, MeZO can train a 30-billion parameter model, whereas fine-tuning with backpropagation can train only a 2.7B LM with the same budget. We conduct comprehensive experiments across model types (masked and autoregressive LMs), model scales (up to 66B), and downstream tasks (classification, multiple-choice, and generation). Our results demonstrate that (1) MeZO significantly outperforms in-context learning and linear probing; (2) MeZO achieves comparable performance to fine-tuning with backpropagation across multiple tasks, with up to 12x memory reduction and up to 2x GPU-hour reduction in our implementation; (3) MeZO is compatible with both full-parameter and parameter-efficient tuning techniques such as LoRA and prefix tuning; (4) MeZO can effectively optimize non-differentiable objectives (e.g., maximizing accuracy or F1). We support our empirical findings with theoretical insights, highlighting how adequate pre-training and task prompts enable MeZO to fine-tune huge models, despite classical ZO analyses suggesting otherwise.

Fine-Tuning Language Models with Just Forward Passes

TL;DR

MeZO introduces a memory-efficient zeroth-order optimizer that enables fine-tuning of trillion-parameter language models with memory comparable to inference. By adapting SPSA gradient estimation into an in-place update, MeZO achieves substantial memory and GPU-hour savings while remaining effective with full-parameter tuning and PEFT, and even supports non-differentiable objectives. The authors provide per-step and global convergence analyses showing that, under a local low-rank Hessian assumption, MeZO’s convergence can be dimension-free and significantly faster than expected given the parameter count. Empirically, MeZO outperforms zero-shot, ICL, and linear probing across model types and scales, approaching or matching standard fine-tuning on many tasks, and scales up to 66B parameters, indicating strong practical impact for scalable LM adaptation.

Abstract

Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zerothorder optimizer (MeZO), adapting the classical ZO-SGD method to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference. For example, with a single A100 80GB GPU, MeZO can train a 30-billion parameter model, whereas fine-tuning with backpropagation can train only a 2.7B LM with the same budget. We conduct comprehensive experiments across model types (masked and autoregressive LMs), model scales (up to 66B), and downstream tasks (classification, multiple-choice, and generation). Our results demonstrate that (1) MeZO significantly outperforms in-context learning and linear probing; (2) MeZO achieves comparable performance to fine-tuning with backpropagation across multiple tasks, with up to 12x memory reduction and up to 2x GPU-hour reduction in our implementation; (3) MeZO is compatible with both full-parameter and parameter-efficient tuning techniques such as LoRA and prefix tuning; (4) MeZO can effectively optimize non-differentiable objectives (e.g., maximizing accuracy or F1). We support our empirical findings with theoretical insights, highlighting how adequate pre-training and task prompts enable MeZO to fine-tune huge models, despite classical ZO analyses suggesting otherwise.
Paper Structure (56 sections, 11 theorems, 71 equations, 7 figures, 23 tables, 2 algorithms)

This paper contains 56 sections, 11 theorems, 71 equations, 7 figures, 23 tables, 2 algorithms.

Key Result

Lemma 1

Let $\mathcal{L}(\bm{\theta})$ be $\ell$-smooth.This is satisfied for the standard cross-entropy objective. For any unbiased gradient estimate ${\bm{g}}({\bm{\theta}}, \mathcal{B})$,

Figures (7)

  • Figure 1: OPT-13B results with zero-shot, in-context learning (ICL), MeZO (we report the best among MeZO/MeZO (LoRA)/MeZO (prefix)), and fine-tuning with Adam (FT). MeZO demonstrates superior results over zero-shot and ICL and performs on par with FT (within 1%) on 7 out of 11 tasks, despite using only 1/12 memory. See Table \ref{['tab:opt']} for detailed numbers and Figure \ref{['fig:memory_fig']} for memory profiling.
  • Figure 2: Experiments on RoBERTa-large. We report zero-shot, linear probing (LP), and MeZO and fine-tuning (FT) with full parameter, LoRA, and prefix-tuning. MeZO outperforms zero-shot and LP and approaches FT (within 5% for $k=512$) with much less memory. Detailed numbers in \ref{['tab:roberta']}.
  • Figure 3: GPU memory consumption with different OPT models and tuning methods on MultiRC (400 tokens per example on average).
  • Figure 4: Largest OPT models that one can tune with specific hardwares and algorithms. $\dagger:$ projected results without actual testing.
  • Figure 5: MeZO does not optimize significantly faster when tuning fewer parameters, agreeing with our theory in \ref{['sec:theory']}.
  • ...and 2 more figures

Theorems & Definitions (25)

  • Definition 1: Simultaneous Perturbation Stochastic Approximation or SPSA spall1992multivariate
  • Definition 2: ZO-SGD
  • Definition 3: Unbiased Gradient Estimate
  • Lemma 1: Descent Lemma
  • Lemma 2
  • Theorem 1: Dimension-Free Rate
  • Corollary 1
  • Definition 4: PL Inequality
  • Definition 5: Gradient Covariance
  • Lemma 3: Global Convergence of ZO-SGD
  • ...and 15 more