Table of Contents
Fetching ...

PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training

Daiyaan Arfeen, Zhen Zhang, Xinwei Fu, Gregory R. Ganger, Yida Wang

TL;DR

PipeFill tackles the inefficiency of pipeline-parallel LLM training by filling pipeline bubbles with independent fill jobs, reclaiming idle GPU time without harming the main training task. It introduces a Pipeline Bubble Instruction, a Fill Job Execution Plan Algorithm, and a Fill Job Scheduler, and integrates these components into a DeepSpeed-based system with GPipe and 1F1B schedules. Empirical results show substantial gains in GPU utilization (up to 63% at 8K GPUs) with minimal overhead (<2%), enabling significantly faster scaling of large models. The approach offers a practical, schedule-aware strategy to improve data-center GPU efficiency for ultra-large DNN training.

Abstract

Training Deep Neural Networks (DNNs) with billions of parameters generally involves pipeline-parallel (PP) execution. Unfortunately, PP model training can use GPUs inefficiently, especially at large scale, due to idle GPU time caused by pipeline bubbles, which are often 15-30% and can exceed 60% of the training job's GPU allocation. To improve the GPU utilization of PP model training, this paper describes PipeFill, which fills pipeline bubbles with execution of other pending jobs. By leveraging bubble GPU time, PipeFill reduces the GPU utilization sacrifice associated with scaling-up of large-model training. To context-switch between fill jobs and the main training job with minimal overhead to the main job, and maximize fill job efficiency, PipeFill carefully fits fill job work to measured bubble durations and GPU memory availability, introduces explicit pipeline-bubble instructions, and orchestrates placement and execution of fill jobs in pipeline bubbles. Experiments show that PipeFill can increase overall utilization by up to 63% for GPUs used in large-scale LLM training, with <2% slowdown of the training job, and 5-15% even for low-scale LLM training. For large-scale LLM training on 8K GPUs, the 63% increase translates to up to 2.6K additional GPUs worth of work completed.

PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training

TL;DR

PipeFill tackles the inefficiency of pipeline-parallel LLM training by filling pipeline bubbles with independent fill jobs, reclaiming idle GPU time without harming the main training task. It introduces a Pipeline Bubble Instruction, a Fill Job Execution Plan Algorithm, and a Fill Job Scheduler, and integrates these components into a DeepSpeed-based system with GPipe and 1F1B schedules. Empirical results show substantial gains in GPU utilization (up to 63% at 8K GPUs) with minimal overhead (<2%), enabling significantly faster scaling of large models. The approach offers a practical, schedule-aware strategy to improve data-center GPU efficiency for ultra-large DNN training.

Abstract

Training Deep Neural Networks (DNNs) with billions of parameters generally involves pipeline-parallel (PP) execution. Unfortunately, PP model training can use GPUs inefficiently, especially at large scale, due to idle GPU time caused by pipeline bubbles, which are often 15-30% and can exceed 60% of the training job's GPU allocation. To improve the GPU utilization of PP model training, this paper describes PipeFill, which fills pipeline bubbles with execution of other pending jobs. By leveraging bubble GPU time, PipeFill reduces the GPU utilization sacrifice associated with scaling-up of large-model training. To context-switch between fill jobs and the main training job with minimal overhead to the main job, and maximize fill job efficiency, PipeFill carefully fits fill job work to measured bubble durations and GPU memory availability, introduces explicit pipeline-bubble instructions, and orchestrates placement and execution of fill jobs in pipeline bubbles. Experiments show that PipeFill can increase overall utilization by up to 63% for GPUs used in large-scale LLM training, with <2% slowdown of the training job, and 5-15% even for low-scale LLM training. For large-scale LLM training on 8K GPUs, the 63% increase translates to up to 2.6K additional GPUs worth of work completed.

Paper Structure

This paper contains 23 sections, 2 equations, 10 figures, 1 table, 1 algorithm.

Figures (10)

  • Figure 1: Utilization of LLM training GPUs. The lines correspond to scaling out training of a 40B-parameter LLM from 1K GPUs to 8K GPUs to reduce training time from 82 days (1K) to 34 days (4K) to 26 days (8K). Traditionally, the increasing pipeline bubbles when scaling out leads to over 60% lower GPU utilization at 8K. PipeFill is able to fill much of that bubble GPU time with useful work, without slowing the LLM training. Section \ref{['sec:setup']} details the experimental setup.
  • Figure 2: Pipeline parallelism combined with data parallelism. Replicating the pipeline (double the number of GPUs) with the overall minibatch size fixed (at 4 microbatches) leads to shorter per-minibatch execution time but also a larger fraction of GPU time lost to pipeline bubbles.
  • Figure 3: System overview
  • Figure 4: Simulator results of running a 40B LLM training job using 1-8K GPUs.
  • Figure 5: GPU TFLOPS of running a 5B LLM on the physical cluster with varying filled bubble durations.
  • ...and 5 more figures