Time Matters: Scaling Laws for Any Budget

Itay Inbar; Luke Sernau

Time Matters: Scaling Laws for Any Budget

Itay Inbar, Luke Sernau

TL;DR

This analysis predicts that in contrast to existing literature, models should be wider rather than deeper, as the benefits of speed outweigh the benefits of depth.

Abstract

A primary cost driver for training large models is wall-clock training time. We show that popular time estimates based on FLOPs are poor estimates, and construct a more accurate proxy based on memory copies. This allows us to accurately estimate the training speed of a transformer model from its hyperparameters. Combined with a scaling law curve like Chinchilla, this allows us to accurately predict the final loss of a model from a simple equation. We show that this expression is accurate across a wide range of model hyperparameter values, enabling us to analytically make architectural decisions and train models more efficiently. Crucially, this analysis predicts that in contrast to existing literature, models should be wider rather than deeper, as the benefits of speed outweigh the benefits of depth.

Time Matters: Scaling Laws for Any Budget

TL;DR

This analysis predicts that in contrast to existing literature, models should be wider rather than deeper, as the benefits of speed outweigh the benefits of depth.

Abstract

Paper Structure (13 sections, 4 equations, 6 figures, 4 tables)

This paper contains 13 sections, 4 equations, 6 figures, 4 tables.

Introduction
The parameter equivalence heuristic
Estimating linear scaling law Coefficients
Equations for estimating the speed of a model
Estimating the throughput
Putting it all together
Better loss with faster models
Other Architectures
Conclusion
Equations
FLOPS derivation
MEMCPYS derivation
PARAMS derivation

Figures (6)

Figure 1: Loss predictions over different trained models via Chinchilla, using our linear coefficients.
Figure 2: Estimating runtime with equation \ref{['time']}
Figure 3: Runtime prediction ablation
Figure 4: Predicted vs actual loss. The prediction is made using only the hyperparameters.
Figure 5: Chinchilla using empirical data consumption vs estimated (ours)
...and 1 more figures

Time Matters: Scaling Laws for Any Budget

TL;DR

Abstract

Time Matters: Scaling Laws for Any Budget

Authors

TL;DR

Abstract

Table of Contents

Figures (6)