Table of Contents
Fetching ...

Time is Not Compute: Scaling Laws for Wall-Clock Constrained Training on Consumer GPUs

Yi Liu

Abstract

Scaling laws relate model quality to compute budget (FLOPs), but practitioners face wall-clock time constraints, not compute budgets. We study optimal model sizing under fixed time budgets from 5 minutes to 24 hours on consumer GPUs (RTX 4090). Across 70+ runs spanning 50M--1031M parameters, we find: (1)~at each time budget a U-shaped curve emerges where too-small models overfit and too-large models undertrain; (2)~optimal model size follows $N^* \propto t^{0.60}$, growing \emph{faster} than Chinchilla's $N^* \propto C^{0.50}$, with $α= 0.60 \pm 0.07$ robustly exceeding compute-optimal across all sensitivity analyses; (3)~a \emph{dual U-shape mechanism}: short-budget U-curves arise from compute bottlenecks, while long-budget U-curves emerge from data bottlenecks (overfitting), with an intermediate regime where the U-curve temporarily disappears. These findings have immediate implications for researchers training on consumer hardware, where wall-clock time -- not FLOPs -- is the binding constraint. We release all code, logs, and 70+ experimental configurations.

Time is Not Compute: Scaling Laws for Wall-Clock Constrained Training on Consumer GPUs

Abstract

Scaling laws relate model quality to compute budget (FLOPs), but practitioners face wall-clock time constraints, not compute budgets. We study optimal model sizing under fixed time budgets from 5 minutes to 24 hours on consumer GPUs (RTX 4090). Across 70+ runs spanning 50M--1031M parameters, we find: (1)~at each time budget a U-shaped curve emerges where too-small models overfit and too-large models undertrain; (2)~optimal model size follows , growing \emph{faster} than Chinchilla's , with robustly exceeding compute-optimal across all sensitivity analyses; (3)~a \emph{dual U-shape mechanism}: short-budget U-curves arise from compute bottlenecks, while long-budget U-curves emerge from data bottlenecks (overfitting), with an intermediate regime where the U-curve temporarily disappears. These findings have immediate implications for researchers training on consumer hardware, where wall-clock time -- not FLOPs -- is the binding constraint. We release all code, logs, and 70+ experimental configurations.

Paper Structure

This paper contains 53 sections, 5 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Main result.(a) U-shaped optimality curves at 8 time budgets (5min--24h). Stars mark optimal model at each budget. (b) Optimal model size vs. time, fitted with $N^* = 14.2 \times t^{0.595}$ ($R^2 = 0.963$). The blue dotted line shows Chinchilla's $\alpha = 0.50$ for reference. (c) Best achievable BPB vs. time, fitted with $L^* = 1.22 \times t^{-0.061}$ ($R^2 = 0.971$), showing severe diminishing returns.
  • Figure 2: Key phenomena. Left: the dual U-shape mechanism across three regimes. Right: training trajectories showing model-specific overfitting dynamics. Additional figures ($\alpha$ convergence, heatmap, dual U-shape detail) in Appendix.
  • Figure 3: BPB heatmap across model sizes and time budgets. The diagonal band of optimal values traces the time-constrained scaling law. Gold boxes mark optimal configurations.
  • Figure 4: $\alpha$ evolution as time points are added: 0.44 (5pt) $\to$ 0.55 (6pt) $\to$ 0.75 (7pt) $\to$ 0.60 (8pt). The non-monotonic convergence reflects regime transitions (Section \ref{['sec:discussion']}).
  • Figure 5: Dual U-shape mechanism.(a) Change in BPB from 12h to 24h per model. Models D14--D20 overfit (red bars, positive $\Delta$), while D24--D26 continue improving (green bars, negative $\Delta$). (b) Conceptual illustration of the dual U-shape.
  • ...and 1 more figures