Table of Contents
Fetching ...

Time Matters: Scaling Laws for Any Budget

Itay Inbar, Luke Sernau

TL;DR

This analysis predicts that in contrast to existing literature, models should be wider rather than deeper, as the benefits of speed outweigh the benefits of depth.

Abstract

A primary cost driver for training large models is wall-clock training time. We show that popular time estimates based on FLOPs are poor estimates, and construct a more accurate proxy based on memory copies. This allows us to accurately estimate the training speed of a transformer model from its hyperparameters. Combined with a scaling law curve like Chinchilla, this allows us to accurately predict the final loss of a model from a simple equation. We show that this expression is accurate across a wide range of model hyperparameter values, enabling us to analytically make architectural decisions and train models more efficiently. Crucially, this analysis predicts that in contrast to existing literature, models should be wider rather than deeper, as the benefits of speed outweigh the benefits of depth.

Time Matters: Scaling Laws for Any Budget

TL;DR

This analysis predicts that in contrast to existing literature, models should be wider rather than deeper, as the benefits of speed outweigh the benefits of depth.

Abstract

A primary cost driver for training large models is wall-clock training time. We show that popular time estimates based on FLOPs are poor estimates, and construct a more accurate proxy based on memory copies. This allows us to accurately estimate the training speed of a transformer model from its hyperparameters. Combined with a scaling law curve like Chinchilla, this allows us to accurately predict the final loss of a model from a simple equation. We show that this expression is accurate across a wide range of model hyperparameter values, enabling us to analytically make architectural decisions and train models more efficiently. Crucially, this analysis predicts that in contrast to existing literature, models should be wider rather than deeper, as the benefits of speed outweigh the benefits of depth.
Paper Structure (13 sections, 4 equations, 6 figures, 4 tables)

This paper contains 13 sections, 4 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Loss predictions over different trained models via Chinchilla, using our linear coefficients.
  • Figure 2: Estimating runtime with equation \ref{['time']}
  • Figure 3: Runtime prediction ablation
  • Figure 4: Predicted vs actual loss. The prediction is made using only the hyperparameters.
  • Figure 5: Chinchilla using empirical data consumption vs estimated (ours)
  • ...and 1 more figures