Table of Contents
Fetching ...

Navigating Scaling Laws: Compute Optimality in Adaptive Model Training

Sotiris Anagnostidis, Gregor Bachmann, Imanol Schlag, Thomas Hofmann

TL;DR

The paper tackles the compute-inefficiency of scaling laws by introducing adaptive training that varies model shape during training to traverse different scaling laws. It develops a principled patch-size and context-length scheduling framework, grounded in inverse scaling laws and gradient-based transitions, and demonstrates substantial FLOPs reductions across Vision Transformers and language models. Key contributions include FlexiViT-based patch-size adaptation, a scheduling strategy that outperforms fixed architectures and random baselines, and extensions to width expansion and training-objective shifts, all pointing to broad applicability for reducing training compute. The findings have practical implications for lowering environmental impact and democratizing access to frontier-model training by enabling compute-efficient optimization across multiple dimensions of model shape.

Abstract

In recent years, the state-of-the-art in deep learning has been dominated by very large models that have been pre-trained on vast amounts of data. The paradigm is very simple: investing more computational resources (optimally) leads to better performance, and even predictably so; neural scaling laws have been derived that accurately forecast the performance of a network for a desired level of compute. This leads to the notion of a `compute-optimal' model, i.e. a model that allocates a given level of compute during training optimally to maximize performance. In this work, we extend the concept of optimality by allowing for an `adaptive' model, i.e. a model that can change its shape during training. By doing so, we can design adaptive models that optimally traverse between the underlying scaling laws and outpace their `static' counterparts, leading to a significant reduction in the required compute to reach a given target performance. We show that our approach generalizes across modalities and different shape parameters.

Navigating Scaling Laws: Compute Optimality in Adaptive Model Training

TL;DR

The paper tackles the compute-inefficiency of scaling laws by introducing adaptive training that varies model shape during training to traverse different scaling laws. It develops a principled patch-size and context-length scheduling framework, grounded in inverse scaling laws and gradient-based transitions, and demonstrates substantial FLOPs reductions across Vision Transformers and language models. Key contributions include FlexiViT-based patch-size adaptation, a scheduling strategy that outperforms fixed architectures and random baselines, and extensions to width expansion and training-objective shifts, all pointing to broad applicability for reducing training compute. The findings have practical implications for lowering environmental impact and democratizing access to frontier-model training by enabling compute-efficient optimization across multiple dimensions of model shape.

Abstract

In recent years, the state-of-the-art in deep learning has been dominated by very large models that have been pre-trained on vast amounts of data. The paradigm is very simple: investing more computational resources (optimally) leads to better performance, and even predictably so; neural scaling laws have been derived that accurately forecast the performance of a network for a desired level of compute. This leads to the notion of a `compute-optimal' model, i.e. a model that allocates a given level of compute during training optimally to maximize performance. In this work, we extend the concept of optimality by allowing for an `adaptive' model, i.e. a model that can change its shape during training. By doing so, we can design adaptive models that optimally traverse between the underlying scaling laws and outpace their `static' counterparts, leading to a significant reduction in the required compute to reach a given target performance. We show that our approach generalizes across modalities and different shape parameters.
Paper Structure (36 sections, 7 equations, 22 figures, 3 tables)

This paper contains 36 sections, 7 equations, 22 figures, 3 tables.

Figures (22)

  • Figure 1: Patch sizes define (left) how images are processed, while (right) impacting the compute of a forward pass.
  • Figure 2: (Left) Hyperparameters are optimized across model classes. (Right) The ViT models used for this study.
  • Figure 3: (Left) Different scaling law curves (function $f$ in Eq. \ref{['eq:single_scaling_law']}) corresponding to different training configurations. Black arrows indicate points of transition between scaling laws. (Middle) We illustrate the inverse of the above function $f^{-1}$ for the same scaling law curves. (Right) We visualize the gradient of the inverse $\partial f^{-1}(E) / \partial E$ for the same scaling laws. Taking the curve that maximizes the aforementioned gradient, leads to a partition of the space. From this partition, we can deduce a strategy determining which scaling law to 'follow' for each performance level.
  • Figure 4: Downstream performance as a function of compute for the V$640$-$12$ model and different patch sizes. We use a log-log scale.
  • Figure 5: Downstream performance of the V$640$-$12$ trained with our patch size scheduler, and its potential benefits.
  • ...and 17 more figures