Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark

Yihua Zhang; Pingzhi Li; Junyuan Hong; Jiaxiang Li; Yimeng Zhang; Wenqing Zheng; Pin-Yu Chen; Jason D. Lee; Wotao Yin; Mingyi Hong; Zhangyang Wang; Sijia Liu; Tianlong Chen

Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark

Yihua Zhang, Pingzhi Li, Junyuan Hong, Jiaxiang Li, Yimeng Zhang, Wenqing Zheng, Pin-Yu Chen, Jason D. Lee, Wotao Yin, Mingyi Hong, Zhangyang Wang, Sijia Liu, Tianlong Chen

TL;DR

This work tackles the memory bottleneck of fine-tuning large language models by benchmarking zeroth-order optimization (ZO) methods as BP-free alternatives to traditional first-order optimizers. It expands beyond prior work to evaluate multiple ZO techniques, task complexities, and PEFT schemes across five LLM families, revealing insights about task alignment, forward gradient as a baseline, and the trade-offs between algorithmic complexity and performance. The study confirms substantial memory savings with ZO methods, analyzes memory under different precisions, and proposes enhancements such as block-wise descent, hybrid ZO-FO training, and gradient sparsity to further improve efficiency. The released code enables reproducibility and provides a practical pathway toward memory-efficient LLM fine-tuning in resource-constrained settings.

Abstract

In the evolving landscape of natural language processing (NLP), fine-tuning pre-trained Large Language Models (LLMs) with first-order (FO) optimizers like SGD and Adam has become standard. Yet, as LLMs grow {in size}, the substantial memory overhead from back-propagation (BP) for FO gradient computation presents a significant challenge. Addressing this issue is crucial, especially for applications like on-device training where memory efficiency is paramount. This paper proposes a shift towards BP-free, zeroth-order (ZO) optimization as a solution for reducing memory costs during LLM fine-tuning, building on the initial concept introduced by MeZO. Unlike traditional ZO-SGD methods, our work expands the exploration to a wider array of ZO optimization techniques, through a comprehensive, first-of-its-kind benchmarking study across five LLM families (Roberta, OPT, LLaMA, Vicuna, Mistral), three task complexities, and five fine-tuning schemes. Our study unveils previously overlooked optimization principles, highlighting the importance of task alignment, the role of the forward gradient method, and the balance between algorithm complexity and fine-tuning performance. We further introduce novel enhancements to ZO optimization, including block-wise descent, hybrid training, and gradient sparsity. Our study offers a promising direction for achieving further memory-efficient LLM fine-tuning. Codes to reproduce all our experiments are at https://github.com/ZO-Bench/ZO-LLM .

Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark

TL;DR

Abstract

Paper Structure (16 sections, 37 equations, 3 figures, 10 tables, 1 algorithm)

This paper contains 16 sections, 37 equations, 3 figures, 10 tables, 1 algorithm.

Introduction
Related Work
Reviewing ZO Optimization and Beyond
Forward gradient: A missing BP-free baseline in LLM fine-tuning.
LLM Fine-Tuning Benchmarking
Benchmark Setups
Experiment Results
An In-Depth Dissection on Memory Efficiency
Extended Study to Improve ZO Fine-Tuning
Conclusion
Zeroth-Order Optimization Algorithms
Preliminaries of Parameter-Efficient Fine-Tuning (PEFT)
How to Implement Memory-Efficient ZO/FO Optimizers?
Theoretical Memory Efficiency Analysis of Different Optimizers
Other Implementation Details
...and 1 more sections

Figures (3)

Figure 1: Results of OPT-13B on the tasks COPA and WinoGrande fine-tuned using ZO/FO optimizers in different PEFT settings.
Figure 2: LoRA-based fine-tuning accuracy of OPT-1.3B on SST2 using ZO-SGD and Forward-Grad over different budgets.
Figure 3: Peak memory comparison of full fine-tuning with FO-SGD and ZO-SGD across various sequence lengths with a fixed effective batch size of $2$. Peak memory consumption was evaluated with the input of synthetic texts generated from random sequences of the specified lengths.

Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark

TL;DR

Abstract

Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark

Authors

TL;DR

Abstract

Table of Contents

Figures (3)