Table of Contents
Fetching ...

KerZOO: Kernel Function Informed Zeroth-Order Optimization for Accurate and Accelerated LLM Fine-Tuning

Zhendong Mi, Qitao Tan, Xiaodong Yu, Zining Zhu, Geng Yuan, Shaoyi Huang

TL;DR

KerZOO tackles the memory bottleneck in fine-tuning large language models by addressing gradient estimation bias in zeroth-order optimization. It introduces a kernel-function informed framework that cancels leading lower-order bias in ZO gradient estimates, backed by theoretical convergence analysis under $L$-smooth and higher-order smoothness assumptions. Empirically, KerZOO delivers faster convergence and competitive accuracy across RoBERTa, OPT, and LLaMA, with substantial reductions in GPU hours and strong compatibility with LoRA. The approach offers a principled, memory-efficient pathway to high-quality fine-tuning of large NLP models and can extend to other PEFT-enabled or modular model adaptations.

Abstract

Large language models (LLMs) have demonstrated impressive capabilities across numerous NLP tasks. Nevertheless, conventional first-order fine-tuning techniques impose heavy memory demands, creating practical obstacles to real-world applications. Zeroth-order (ZO) optimization has recently emerged as a promising memory-efficient alternative, as it circumvents the need for backpropagation by estimating gradients solely through forward passes--making it particularly suitable for resource-limited environments. Despite its efficiency, ZO optimization suffers from gradient estimation bias, which significantly hinders convergence speed. To address this, we analytically identify and characterize the lower-order bias introduced during ZO-based gradient estimation in LLM fine-tuning. Motivated by tools in mathematical physics, we introduce a kernel-function-based ZO framework aimed at mitigating this bias and improving optimization stability. KerZOO achieves comparable or superior performance to existing ZO baselines in both full-parameter and parameter-efficient fine-tuning settings of LLMs, while significantly reducing the number of iterations required to reach convergence. For example, KerZOO reduces total GPU training hours by as much as 74% and 44% on WSC and MultiRC datasets in fine-tuning OPT-2.7B model and can exceed the MeZO baseline by 2.9% and 2.6% in accuracy. We show that the kernel function is an effective avenue for reducing estimation bias in ZO methods.

KerZOO: Kernel Function Informed Zeroth-Order Optimization for Accurate and Accelerated LLM Fine-Tuning

TL;DR

KerZOO tackles the memory bottleneck in fine-tuning large language models by addressing gradient estimation bias in zeroth-order optimization. It introduces a kernel-function informed framework that cancels leading lower-order bias in ZO gradient estimates, backed by theoretical convergence analysis under -smooth and higher-order smoothness assumptions. Empirically, KerZOO delivers faster convergence and competitive accuracy across RoBERTa, OPT, and LLaMA, with substantial reductions in GPU hours and strong compatibility with LoRA. The approach offers a principled, memory-efficient pathway to high-quality fine-tuning of large NLP models and can extend to other PEFT-enabled or modular model adaptations.

Abstract

Large language models (LLMs) have demonstrated impressive capabilities across numerous NLP tasks. Nevertheless, conventional first-order fine-tuning techniques impose heavy memory demands, creating practical obstacles to real-world applications. Zeroth-order (ZO) optimization has recently emerged as a promising memory-efficient alternative, as it circumvents the need for backpropagation by estimating gradients solely through forward passes--making it particularly suitable for resource-limited environments. Despite its efficiency, ZO optimization suffers from gradient estimation bias, which significantly hinders convergence speed. To address this, we analytically identify and characterize the lower-order bias introduced during ZO-based gradient estimation in LLM fine-tuning. Motivated by tools in mathematical physics, we introduce a kernel-function-based ZO framework aimed at mitigating this bias and improving optimization stability. KerZOO achieves comparable or superior performance to existing ZO baselines in both full-parameter and parameter-efficient fine-tuning settings of LLMs, while significantly reducing the number of iterations required to reach convergence. For example, KerZOO reduces total GPU training hours by as much as 74% and 44% on WSC and MultiRC datasets in fine-tuning OPT-2.7B model and can exceed the MeZO baseline by 2.9% and 2.6% in accuracy. We show that the kernel function is an effective avenue for reducing estimation bias in ZO methods.

Paper Structure

This paper contains 35 sections, 46 equations, 4 figures, 9 tables, 1 algorithm.

Figures (4)

  • Figure 1: Training loss comparison of MeZO and KerZOO (Ours) on RoBERTa-large
  • Figure 2: Comparison of GPU hours for convergence across different datasets on OPT-2.7B between MeZO and KerZOO. Results are presented as normalized time
  • Figure 3: Results on LLaMA3-3B and LLaMA3-8B
  • Figure 4: Training loss curves of MeZO with different perturbations and KerZOO