Table of Contents
Fetching ...

ZO2: Scalable Zeroth-Order Fine-Tuning for Extremely Large Language Models with Limited GPU Memory

Liangyu Wang, Jie Ren, Hang Xu, Junxiao Wang, Huanyi Xie, David E. Keyes, Di Wang

TL;DR

ZO2 tackles the memory bottleneck in fine-tuning extremely large language models by offloading computations to the CPU and employing zeroth-order optimization with dual forward passes. The framework introduces RNG state synchronization, a dynamic overlap scheduler, reusable GPU memory, efficient update strategies, and AMP-based low-bit compression to achieve substantial memory reductions (e.g., training OPT-175B on 18GB) with negligible time overhead and no loss in accuracy. Comprehensive experiments on the OPT family demonstrate memory savings while maintaining near-baseline throughput and identical accuracy on multiple benchmarks, supported by an RNG-management mechanism that preserves perturbation-consistent updates. The work democratizes access to extremely large model fine-tuning on commodity hardware and provides a public codebase to foster further research and practical deployment.

Abstract

Fine-tuning large pre-trained LLMs generally demands extensive GPU memory. Traditional first-order optimizers like SGD encounter substantial difficulties due to increased memory requirements from storing activations and gradients during both the forward and backward phases as the model size expands. Alternatively, zeroth-order (ZO) techniques can compute gradients using just forward operations, eliminating the need to store activations. Furthermore, by leveraging CPU capabilities, it's feasible to enhance both the memory and processing power available to a single GPU. We propose a novel framework, ZO2 (Zeroth-Order Offloading), for efficient zeroth-order fine-tuning of LLMs with only limited GPU memory. Our framework dynamically shifts model parameters between the CPU and GPU as required, optimizing computation flow and maximizing GPU usage by minimizing downtime. This integration of parameter adjustments with ZO's double forward operations reduces unnecessary data movement, enhancing the fine-tuning efficacy. Additionally, our framework supports an innovative low-bit precision approach in AMP mode to streamline data exchanges between the CPU and GPU. Employing this approach allows us to fine-tune extraordinarily large models, such as the OPT-175B with more than 175 billion parameters, on a mere 18GB GPU--achievements beyond the reach of traditional methods. Moreover, our framework achieves these results with almost no additional time overhead and absolutely no accuracy loss compared to standard zeroth-order methods. ZO2's code has been open-sourced in https://github.com/liangyuwang/zo2.

ZO2: Scalable Zeroth-Order Fine-Tuning for Extremely Large Language Models with Limited GPU Memory

TL;DR

ZO2 tackles the memory bottleneck in fine-tuning extremely large language models by offloading computations to the CPU and employing zeroth-order optimization with dual forward passes. The framework introduces RNG state synchronization, a dynamic overlap scheduler, reusable GPU memory, efficient update strategies, and AMP-based low-bit compression to achieve substantial memory reductions (e.g., training OPT-175B on 18GB) with negligible time overhead and no loss in accuracy. Comprehensive experiments on the OPT family demonstrate memory savings while maintaining near-baseline throughput and identical accuracy on multiple benchmarks, supported by an RNG-management mechanism that preserves perturbation-consistent updates. The work democratizes access to extremely large model fine-tuning on commodity hardware and provides a public codebase to foster further research and practical deployment.

Abstract

Fine-tuning large pre-trained LLMs generally demands extensive GPU memory. Traditional first-order optimizers like SGD encounter substantial difficulties due to increased memory requirements from storing activations and gradients during both the forward and backward phases as the model size expands. Alternatively, zeroth-order (ZO) techniques can compute gradients using just forward operations, eliminating the need to store activations. Furthermore, by leveraging CPU capabilities, it's feasible to enhance both the memory and processing power available to a single GPU. We propose a novel framework, ZO2 (Zeroth-Order Offloading), for efficient zeroth-order fine-tuning of LLMs with only limited GPU memory. Our framework dynamically shifts model parameters between the CPU and GPU as required, optimizing computation flow and maximizing GPU usage by minimizing downtime. This integration of parameter adjustments with ZO's double forward operations reduces unnecessary data movement, enhancing the fine-tuning efficacy. Additionally, our framework supports an innovative low-bit precision approach in AMP mode to streamline data exchanges between the CPU and GPU. Employing this approach allows us to fine-tune extraordinarily large models, such as the OPT-175B with more than 175 billion parameters, on a mere 18GB GPU--achievements beyond the reach of traditional methods. Moreover, our framework achieves these results with almost no additional time overhead and absolutely no accuracy loss compared to standard zeroth-order methods. ZO2's code has been open-sourced in https://github.com/liangyuwang/zo2.

Paper Structure

This paper contains 22 sections, 3 equations, 7 figures, 7 tables, 3 algorithms.

Figures (7)

  • Figure 1: Single GPU memory usage comparison for training LLMs across different optimizers (AdamW, SGD, MeZO, and ZO2 (Zeroth-Order Offload)) and model sizes (OPT-6.7B, OPT-13B, OPT-30B, OPT-175B). The 'X' indicates that training was not feasible due to excessive memory demand.
  • Figure 2: Workflow of the ZO2 framework for fine-tuning LLMs.
  • Figure 3: Motivation. (a) First-Order Optimizer: Employs a forward-backward pass sequence, where input $X$ undergoes multiple linear transformations (Linear 1, 2, 3) producing activations ($X_1, X_2$) and final output $Y$. The backward pass calculates gradients ($dW_1, dW_2, dW_3$), with parameters $W$ reloaded from CPU to GPU for gradient descent, leading to dual transfers and high GPU memory usage for activations. (b) Zeroth-Order Optimizer: Uses dual forward passes with perturbed weights ($W_1', W_2', W_3'$), generating outputs ($X', X_1', X_2'$) and $Y'$ to compute dual loss and approximate gradients. This approach avoids activations storage and reduces GPU-CPU data transfers by only requiring a single parameter transmission after the last computation, optimizing resource use for large models on limited hardware.
  • Figure 4: Sequential Task Execution in the Naive ZO2 Framework Depicting Non-overlapping Dual Forward Passes and Associated Inefficiencies
  • Figure 5: Comparison of model parameters updates without/with efficient strategy. (a) illustrates the process where, at the $j$-th iteration, the model computes the projected gradient $g_j$ using the dual-forward method and subsequently updates the model parameters. (b) demonstrates that at the $j$-th iteration, the model first updates the parameters using the previously saved projected gradient $g_{j-1}$, and then performs the dual-forward pass to compute the new projected gradient $g_j$.
  • ...and 2 more figures