Table of Contents
Fetching ...

Data movement limits to frontier model training

Ege Erdil, David Schneider-Joseph

TL;DR

A theoretical model of distributed training is presented, and it is used to analyze how far dense and sparse training runs can be scaled, suggesting the arrival of fundamental barriers to scaling in three years given recent rates of growth.

Abstract

We present a theoretical model of distributed training, and use it to analyze how far dense and sparse training runs can be scaled. Under our baseline assumptions, given a three month training duration, data movement bottlenecks begin to significantly lower hardware utilization for training runs exceeding about $10^{28}$ FLOP, two orders of magnitude above the largest training run to date, suggesting the arrival of fundamental barriers to scaling in three years given recent rates of growth. A training run exceeding about $10^{31}$ FLOP is infeasible even at low utilization. However, more aggressive batch size scaling and/or shorter and fatter model shapes, if achievable, have the potential to permit much larger training runs.

Data movement limits to frontier model training

TL;DR

A theoretical model of distributed training is presented, and it is used to analyze how far dense and sparse training runs can be scaled, suggesting the arrival of fundamental barriers to scaling in three years given recent rates of growth.

Abstract

We present a theoretical model of distributed training, and use it to analyze how far dense and sparse training runs can be scaled. Under our baseline assumptions, given a three month training duration, data movement bottlenecks begin to significantly lower hardware utilization for training runs exceeding about FLOP, two orders of magnitude above the largest training run to date, suggesting the arrival of fundamental barriers to scaling in three years given recent rates of growth. A training run exceeding about FLOP is infeasible even at low utilization. However, more aggressive batch size scaling and/or shorter and fatter model shapes, if achievable, have the potential to permit much larger training runs.

Paper Structure

This paper contains 40 sections, 59 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: With current technology, such as the H100 GPU and current scaling techniques, data movement bottlenecks lower hardware utilization for training runs exceeding $10^{28}$ FLOP, and a "latency wall" renders surpassing $10^{31}$ FLOP infeasible (left). However, with innovations in scaling (such as techniques to enable much larger batch sizes) or dramatic increases in network bandwidth coupled with a 10$\times$ reduction in inter- and intra-GPU latency, training runs can be at least a few orders of magnitude larger (right).
  • Figure 2: Data parallelism. The input data is divided into shards and processed independently by multiple model replicas. Each replica computes gradients for its local shard, which are then all-reduced across all replicas to obtain the full batch gradient. This full gradient finally updates the model parameters on each replica.
  • Figure 3: 2D tensor parallelism for an MLP block expert's first layer weight matrix $W$. Start: Input vector $X_{\text{in}}$ is scattered "horizontally" (across $N_m = N_{\text{TP, model}}$ GPUs) and duplicated "vertically" (across $N_f = N_{\text{TP, ff}}$ GPUs), and weight matrix shards $W_{11}$ through $W_{N_fN_m}$ are scattered along $d_{\text{ff}}$ (rows) and $d_{\text{model}}$ (columns) in the same set of GPUs. Step 1: Each GPU computes a partially-reduced shard of output matrix $X_\text{out}$. Step 2: All-reduce operation across GPU rows computes fully-reduced shards of $X_\text{out}$. End: Each GPU on a given row has an identical, fully-reduced copy of the $X_\text{out}$ shard corresponding to that row.
  • Figure 4: Pipeline parallelism. This diagram depicts the sequential execution flow of microbatches $F_1$ to $F_4$ during the training of a deep learning model using a GPipe pipeline schedule huang2019gpipe. Device groups, each containing one or more GPUs, are allocated distinct sets of model layers, shown here as residual blocks, and they process the microbatches in a staggered fashion. Pipeline bubbles, the white areas, indicate periods of GPU inactivity due to dependency waits. The optimizer step, highlighted in gray, follows the backward passes and is where model parameter updates occur.
  • Figure 5: Default cuBLAS GEMM kernel performance on an A100 GPU. The green and blue curves show observed wall clock time and utilization, respectively. The purple curve shows a cubic polynomial fit to wall clock time, with dashed lines at the boundaries between a constant $\approx 4.5 \text{ }\mu\text{s}$ latency-bound regime, a linear (likely occupancy-bound) regime, and a cubic compute-bound regime.
  • ...and 6 more figures