Table of Contents
Fetching ...

GreedySnake: Accelerating SSD-Offloaded LLM Training with Efficient Scheduling and Optimizer Step Overlapping

Yikang Yue, Yishu Yin, Xuehai Qian

TL;DR

SSD-offloaded training enables training very large LLMs beyond GPU memory limits but is bottlenecked by optimizer I/O and data movement. GreedySnake introduces vertical gradient accumulation and pipelined scheduling to execute all micro-batches per layer, enabling extensive overlap between backward passes and optimizer steps, and even partial overlap of optimizer steps with the next forward pass. An LP-based configuration search automates data placement and overlap ratios to maximize saturated throughput under memory and bandwidth constraints. On A100 GPUs, GreedySnake delivers up to ~2.5x throughput over ZeRO-Infinity on GPT-65B/175B workloads and demonstrates robust performance with small micro-batches and reduced batch-size requirements, with code open-sourced.

Abstract

SSD-offloaded training offers a practical and promising approach to making LLM training cost-effective. Building on gradient accumulation with micro-batches, this paper introduces GreedySnake, a new SSD-offloaded training system that employs vertical scheduling, which executes all microbatches of a layer before proceeding to the next. Compared to existing systems that use horizontal scheduling (i.e., executing micro-batches sequentially), GreedySnake achieves higher training throughput with smaller batch sizes, bringing the system much closer to the ideal scenario predicted by the roofline model. To further mitigate the I/O bottleneck, GreedySnake overlaps part of the optimization step with the forward pass of the next iteration. Experimental results on A100 GPUs show that GreedySnake achieves saturated training throughput improvements over ZeRO-Infinity: 1.96x on 1 GPU and 1.93x on 4 GPUs for GPT-65B, and 2.53x on 1 GPU for GPT-175B. The code is open-sourced at https://github.com/npz7yyk/GreedySnake

GreedySnake: Accelerating SSD-Offloaded LLM Training with Efficient Scheduling and Optimizer Step Overlapping

TL;DR

SSD-offloaded training enables training very large LLMs beyond GPU memory limits but is bottlenecked by optimizer I/O and data movement. GreedySnake introduces vertical gradient accumulation and pipelined scheduling to execute all micro-batches per layer, enabling extensive overlap between backward passes and optimizer steps, and even partial overlap of optimizer steps with the next forward pass. An LP-based configuration search automates data placement and overlap ratios to maximize saturated throughput under memory and bandwidth constraints. On A100 GPUs, GreedySnake delivers up to ~2.5x throughput over ZeRO-Infinity on GPT-65B/175B workloads and demonstrates robust performance with small micro-batches and reduced batch-size requirements, with code open-sourced.

Abstract

SSD-offloaded training offers a practical and promising approach to making LLM training cost-effective. Building on gradient accumulation with micro-batches, this paper introduces GreedySnake, a new SSD-offloaded training system that employs vertical scheduling, which executes all microbatches of a layer before proceeding to the next. Compared to existing systems that use horizontal scheduling (i.e., executing micro-batches sequentially), GreedySnake achieves higher training throughput with smaller batch sizes, bringing the system much closer to the ideal scenario predicted by the roofline model. To further mitigate the I/O bottleneck, GreedySnake overlaps part of the optimization step with the forward pass of the next iteration. Experimental results on A100 GPUs show that GreedySnake achieves saturated training throughput improvements over ZeRO-Infinity: 1.96x on 1 GPU and 1.93x on 4 GPUs for GPT-65B, and 2.53x on 1 GPU for GPT-175B. The code is open-sourced at https://github.com/npz7yyk/GreedySnake

Paper Structure

This paper contains 23 sections, 13 figures, 2 tables, 1 algorithm.

Figures (13)

  • Figure 1: Gradient Accumulation: From Horizontal to Vertical. The swapping traffic is only plotted for a subset of layers.
  • Figure 2: A conceptual diagram of the heterogeneous memory LLM training. Overview of (a) the forward pass, (b) the backward pass, and (c) the optimizer step.
  • Figure 3: Roofline model of SSD-offloaded Training
  • Figure 4: Batch size scaling in single forward-backward schedule. We use GPT-65B (Section \ref{['sec:eval-setup']}) as an example.
  • Figure 5: Impact of horizontal vs. vertical scheduling on GPU load and offload traffic. We use GPT-65B as an example.
  • ...and 8 more figures