Navigating Scaling Laws: Compute Optimality in Adaptive Model Training
Sotiris Anagnostidis, Gregor Bachmann, Imanol Schlag, Thomas Hofmann
TL;DR
The paper tackles the compute-inefficiency of scaling laws by introducing adaptive training that varies model shape during training to traverse different scaling laws. It develops a principled patch-size and context-length scheduling framework, grounded in inverse scaling laws and gradient-based transitions, and demonstrates substantial FLOPs reductions across Vision Transformers and language models. Key contributions include FlexiViT-based patch-size adaptation, a scheduling strategy that outperforms fixed architectures and random baselines, and extensions to width expansion and training-objective shifts, all pointing to broad applicability for reducing training compute. The findings have practical implications for lowering environmental impact and democratizing access to frontier-model training by enabling compute-efficient optimization across multiple dimensions of model shape.
Abstract
In recent years, the state-of-the-art in deep learning has been dominated by very large models that have been pre-trained on vast amounts of data. The paradigm is very simple: investing more computational resources (optimally) leads to better performance, and even predictably so; neural scaling laws have been derived that accurately forecast the performance of a network for a desired level of compute. This leads to the notion of a `compute-optimal' model, i.e. a model that allocates a given level of compute during training optimally to maximize performance. In this work, we extend the concept of optimality by allowing for an `adaptive' model, i.e. a model that can change its shape during training. By doing so, we can design adaptive models that optimally traverse between the underlying scaling laws and outpace their `static' counterparts, leading to a significant reduction in the required compute to reach a given target performance. We show that our approach generalizes across modalities and different shape parameters.
