Navigating Scaling Laws: Compute Optimality in Adaptive Model Training

Sotiris Anagnostidis; Gregor Bachmann; Imanol Schlag; Thomas Hofmann

Navigating Scaling Laws: Compute Optimality in Adaptive Model Training

Sotiris Anagnostidis, Gregor Bachmann, Imanol Schlag, Thomas Hofmann

TL;DR

The paper tackles the compute-inefficiency of scaling laws by introducing adaptive training that varies model shape during training to traverse different scaling laws. It develops a principled patch-size and context-length scheduling framework, grounded in inverse scaling laws and gradient-based transitions, and demonstrates substantial FLOPs reductions across Vision Transformers and language models. Key contributions include FlexiViT-based patch-size adaptation, a scheduling strategy that outperforms fixed architectures and random baselines, and extensions to width expansion and training-objective shifts, all pointing to broad applicability for reducing training compute. The findings have practical implications for lowering environmental impact and democratizing access to frontier-model training by enabling compute-efficient optimization across multiple dimensions of model shape.

Abstract

In recent years, the state-of-the-art in deep learning has been dominated by very large models that have been pre-trained on vast amounts of data. The paradigm is very simple: investing more computational resources (optimally) leads to better performance, and even predictably so; neural scaling laws have been derived that accurately forecast the performance of a network for a desired level of compute. This leads to the notion of a `compute-optimal' model, i.e. a model that allocates a given level of compute during training optimally to maximize performance. In this work, we extend the concept of optimality by allowing for an `adaptive' model, i.e. a model that can change its shape during training. By doing so, we can design adaptive models that optimally traverse between the underlying scaling laws and outpace their `static' counterparts, leading to a significant reduction in the required compute to reach a given target performance. We show that our approach generalizes across modalities and different shape parameters.

Navigating Scaling Laws: Compute Optimality in Adaptive Model Training

TL;DR

Abstract

Paper Structure (36 sections, 7 equations, 22 figures, 3 tables)

This paper contains 36 sections, 7 equations, 22 figures, 3 tables.

Introduction
Related Work
ViTs and Optimal Patch Sizes
Fixed patch size training.
Adaptive Patch Sizes and Traversing Scaling Laws
Adaptive patch size.
Traversing scaling laws.
Scheduled training.
Is the schedule optimal?
Smaller patch sizes.
Adaptive Context Size of an LLM
Other Shape Parameters
Adapting Model Width
Scaling width.
Scheduling width.
...and 21 more sections

Figures (22)

Figure 1: Patch sizes define (left) how images are processed, while (right) impacting the compute of a forward pass.
Figure 2: (Left) Hyperparameters are optimized across model classes. (Right) The ViT models used for this study.
Figure 3: (Left) Different scaling law curves (function $f$ in Eq. \ref{['eq:single_scaling_law']}) corresponding to different training configurations. Black arrows indicate points of transition between scaling laws. (Middle) We illustrate the inverse of the above function $f^{-1}$ for the same scaling law curves. (Right) We visualize the gradient of the inverse $\partial f^{-1}(E) / \partial E$ for the same scaling laws. Taking the curve that maximizes the aforementioned gradient, leads to a partition of the space. From this partition, we can deduce a strategy determining which scaling law to 'follow' for each performance level.
Figure 4: Downstream performance as a function of compute for the V$640$-$12$ model and different patch sizes. We use a log-log scale.
Figure 5: Downstream performance of the V$640$-$12$ trained with our patch size scheduler, and its potential benefits.
...and 17 more figures

Navigating Scaling Laws: Compute Optimality in Adaptive Model Training

TL;DR

Abstract

Navigating Scaling Laws: Compute Optimality in Adaptive Model Training

Authors

TL;DR

Abstract

Table of Contents

Figures (22)