Data movement limits to frontier model training

Ege Erdil; David Schneider-Joseph

Data movement limits to frontier model training

Ege Erdil, David Schneider-Joseph

TL;DR

A theoretical model of distributed training is presented, and it is used to analyze how far dense and sparse training runs can be scaled, suggesting the arrival of fundamental barriers to scaling in three years given recent rates of growth.

Abstract

We present a theoretical model of distributed training, and use it to analyze how far dense and sparse training runs can be scaled. Under our baseline assumptions, given a three month training duration, data movement bottlenecks begin to significantly lower hardware utilization for training runs exceeding about $10^{28}$ FLOP, two orders of magnitude above the largest training run to date, suggesting the arrival of fundamental barriers to scaling in three years given recent rates of growth. A training run exceeding about $10^{31}$ FLOP is infeasible even at low utilization. However, more aggressive batch size scaling and/or shorter and fatter model shapes, if achievable, have the potential to permit much larger training runs.

Data movement limits to frontier model training

TL;DR

Abstract

FLOP, two orders of magnitude above the largest training run to date, suggesting the arrival of fundamental barriers to scaling in three years given recent rates of growth. A training run exceeding about

FLOP is infeasible even at low utilization. However, more aggressive batch size scaling and/or shorter and fatter model shapes, if achievable, have the potential to permit much larger training runs.

Data movement limits to frontier model training

TL;DR

Abstract

Data movement limits to frontier model training

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)