Table of Contents
Fetching ...

Learning a Zeroth-Order Optimizer for Fine-Tuning LLMs

Kairun Zhang, Haoyu Li, Yanjun Zhao, Yifan Sun, Huan Zhang

TL;DR

This paper tackles the memory demands of fine-tuning large language models by using a learning-to-learn zeroth-order optimizer. It introduces ZO Fine-tuner, which learns adaptive, per-block perturbation variances to guide gradient-free updates, leveraging the block-diagonal Hessian structure of LLMs. The method trains once on a base model and transfers to derivatives and diverse downstream tasks, achieving 82.1% wins and an average 2.5% accuracy improvement across 28 task-model pairs with minimal overhead. This work offers a practical path toward memory-efficient fine-tuning at the foundation-model scale by combining L2L with compact perturbation learning.

Abstract

Zeroth-order optimizers have recently emerged as a practical approach for fine-tuning large language models (LLMs), significantly reducing GPU memory consumption compared to traditional first-order methods. Yet, existing zeroth-order methods rely on hand-crafted, static sampling strategies that are not adaptable to model-specific structures. To address this, we propose ZO Fine-tuner, a learning-based zeroth-order optimizer for LLMs that automatically learns efficient perturbation strategies through a compact and memory-efficient design. Crucially, our approach is motivated by the observation that only a small number of foundation models and their derivatives are widely adopted in practice. Therefore, learning the optimizer once for a given LLM and reusing it across diverse downstream tasks is both feasible and highly desirable. Accordingly, ZO Fine-tuner is designed to scale learning to learn (L2L) to the foundation-model era by supporting one-time training per LLM with minimal overhead. Experiments on 4 LLMs and 7 datasets show that ZO Fine-tuner outperforms prior zeroth-order baselines in 82.1\% of task-model combinations, thereby demonstrating strong performance and scalability for efficient LLM fine-tuning. Our code is available at https://github.com/ASTRAL-Group/ZO_Fine_tuner.git.

Learning a Zeroth-Order Optimizer for Fine-Tuning LLMs

TL;DR

This paper tackles the memory demands of fine-tuning large language models by using a learning-to-learn zeroth-order optimizer. It introduces ZO Fine-tuner, which learns adaptive, per-block perturbation variances to guide gradient-free updates, leveraging the block-diagonal Hessian structure of LLMs. The method trains once on a base model and transfers to derivatives and diverse downstream tasks, achieving 82.1% wins and an average 2.5% accuracy improvement across 28 task-model pairs with minimal overhead. This work offers a practical path toward memory-efficient fine-tuning at the foundation-model scale by combining L2L with compact perturbation learning.

Abstract

Zeroth-order optimizers have recently emerged as a practical approach for fine-tuning large language models (LLMs), significantly reducing GPU memory consumption compared to traditional first-order methods. Yet, existing zeroth-order methods rely on hand-crafted, static sampling strategies that are not adaptable to model-specific structures. To address this, we propose ZO Fine-tuner, a learning-based zeroth-order optimizer for LLMs that automatically learns efficient perturbation strategies through a compact and memory-efficient design. Crucially, our approach is motivated by the observation that only a small number of foundation models and their derivatives are widely adopted in practice. Therefore, learning the optimizer once for a given LLM and reusing it across diverse downstream tasks is both feasible and highly desirable. Accordingly, ZO Fine-tuner is designed to scale learning to learn (L2L) to the foundation-model era by supporting one-time training per LLM with minimal overhead. Experiments on 4 LLMs and 7 datasets show that ZO Fine-tuner outperforms prior zeroth-order baselines in 82.1\% of task-model combinations, thereby demonstrating strong performance and scalability for efficient LLM fine-tuning. Our code is available at https://github.com/ASTRAL-Group/ZO_Fine_tuner.git.

Paper Structure

This paper contains 24 sections, 3 theorems, 13 equations, 6 figures, 8 tables, 1 algorithm.

Key Result

Theorem 1

Define the expected change in loss after performing a one step update in parameter $\theta_t$ as $d(\theta_t)\!:=\!\mathbb{E}\left[\mathcal{L}(\theta_{t+1})\mid\theta_t\right]\!-\!\mathcal{L}(\theta_t)$. Suppose now the Hessian matrix $H(\theta_t)$ is block-diagonal $H(\theta_t)\!:=\!\operatorname{d

Figures (6)

  • Figure 1: Fine-tune the LLM using trained ZO Fine-tuner. Each block of the LLM is equipped with a lightweight neural network that predicts its perturbation variance. For LLM parameter $\theta_{\scriptsize t}^{\scriptsize i}$ in block $i$ at step $t$, $\mathrm{PertNN}_{\scriptsize i}$ takes in compact summarizing statistics containing the $\operatorname{Mean}_{\scriptsize t}^{\scriptsize i}$, $\operatorname{Var}_{\scriptsize t}^{\scriptsize i}$ of the $\theta_{\scriptsize t}^{\scriptsize i}$. Additionally, it takes in the last perturbation variance $\sigma_{\scriptsize t-1}^{\scriptsize i}$, and the two losses recorded at the last step. It outputs the updated perturbation variance $\sigma_{\scriptsize t}^{\scriptsize i}$ and then applies normalization. By learning non-uniform, layer-specific perturbation scales and plugging them into standard zeroth-order updates, the fine-tuner enables efficient, high-performance gradient-free optimization of LLM.
  • Figure 2: Loss comparison across different methods on various datasets and LLMs. Models (columns) are LLaMA-3.2-1B, LLaMA-3.1-8B, Qwen2.5-14B and OPT‑30B, while datasets (rows) cover COPA, SST‑2, CB, SQuAD, WSC, BoolQ and DROP. All curves use the best hyperparameters found for each method. The shaded region around each curve shows the standard deviation of the smoothed loss—the wider the shade, the larger the fluctuation. ZO Fine-tuner shows advantages in both convergence speed and final loss value across most settings.
  • Figure 3: Loss curves under varying learning rates for different optimizers on (top) SST2 with LLaMA-3.1-8B, and (bottom) SQuAD with Qwen2.5-14B.
  • Figure 4: Loss curves under varying learning rates for different optimizers with LLaMA-1B (top) and Qwen-14B (bottom). We report results on SST2, Copa, and SQuAD. For MeZO-Adam, note that the actual learning rates used were $10^{-4}$, $10^{-5}$, and $10^{-6}$, corresponding to the plotted values of $10^{-6}$, $10^{-7}$, and $10^{-8}$, respectively.
  • Figure 5: Loss curves under varying learning rates for different optimizers with Qwen-14B (top) and OPT-30B (bottom). We report results on SST2, Copa, and SQuAD.
  • ...and 1 more figures

Theorems & Definitions (5)

  • Theorem 1: Informal Version
  • Definition 1: Expected Loss Change
  • Theorem 2
  • Theorem 3
  • proof