Table of Contents
Fetching ...

Gemstones: A Model Suite for Multi-Faceted Scaling Laws

Sean McLeish, John Kirchenbauer, David Yu Miller, Siddharth Singh, Abhinav Bhatele, Micah Goldblum, Ashwinee Panda, Tom Goldstein

TL;DR

The paper investigates scaling laws for transformer models and demonstrates that their prescriptions are highly sensitive to architectural shape and training choices. It introduces Gemstones, a large open-source suite of over 4000 checkpoints across 50M–2B parameters, with varied widths, depths, tokens, and hyperparameters, enabling robust scaling analyses. A convex-hull fitting method is proposed to isolate the true frontier of optimal models, revealing fragility in prior laws and highlighting the impact of data sampling and model selection. The study also examines downstream benchmarks and compute-time versus FLOPs trade-offs, showing that depth can improve benchmark performance under the same FLOPs while width often yields better time efficiency, underscoring practical implications for model design and deployment. Overall, the work provides a reproducible framework and dataset for studying scaling laws in open-transformer regimes and clarifies how design choices shape scaling prescriptions.

Abstract

Scaling laws are typically fit using a family of models with a narrow range of frozen hyperparameter choices. In this work we study scaling laws using multiple architectural shapes and hyperparameter choices, highlighting their impact on resulting prescriptions. As a primary artifact of our research, we release the Gemstones: an open-source scaling law dataset, consisting of over 4000 checkpoints from transformers with up to 2 billion parameters and diverse architectural shapes; including ablations over learning rate and cooldown. Our checkpoints enable more complex studies of scaling, such as analyzing the relationship between width and depth. By examining our model suite, we find that the prescriptions of scaling laws can be highly sensitive to the experimental design process and the specific model checkpoints used during fitting.

Gemstones: A Model Suite for Multi-Faceted Scaling Laws

TL;DR

The paper investigates scaling laws for transformer models and demonstrates that their prescriptions are highly sensitive to architectural shape and training choices. It introduces Gemstones, a large open-source suite of over 4000 checkpoints across 50M–2B parameters, with varied widths, depths, tokens, and hyperparameters, enabling robust scaling analyses. A convex-hull fitting method is proposed to isolate the true frontier of optimal models, revealing fragility in prior laws and highlighting the impact of data sampling and model selection. The study also examines downstream benchmarks and compute-time versus FLOPs trade-offs, showing that depth can improve benchmark performance under the same FLOPs while width often yields better time efficiency, underscoring practical implications for model design and deployment. Overall, the work provides a reproducible framework and dataset for studying scaling laws in open-transformer regimes and clarifies how design choices shape scaling prescriptions.

Abstract

Scaling laws are typically fit using a family of models with a narrow range of frozen hyperparameter choices. In this work we study scaling laws using multiple architectural shapes and hyperparameter choices, highlighting their impact on resulting prescriptions. As a primary artifact of our research, we release the Gemstones: an open-source scaling law dataset, consisting of over 4000 checkpoints from transformers with up to 2 billion parameters and diverse architectural shapes; including ablations over learning rate and cooldown. Our checkpoints enable more complex studies of scaling, such as analyzing the relationship between width and depth. By examining our model suite, we find that the prescriptions of scaling laws can be highly sensitive to the experimental design process and the specific model checkpoints used during fitting.

Paper Structure

This paper contains 51 sections, 7 equations, 24 figures, 2 tables.

Figures (24)

  • Figure 1: The meaning of width and depth. We visualize a standard transformer architecture, highlighting the "width" as the size of the hidden dimension and the "depth" as the number of transformer blocks.
  • Figure 2: Distribution of prior scaling law models, industry models, and our models in terms of width and depth. Prior work (purple and green) and industry models (blue and orange) mostly lie on a fixed width-depth line.
  • Figure 3: Approach 1 prescriptions. Row one: Validation loss over FLOPs (left) and GPU hours (right) for the first $100$ billion tokens of training. We use Approach 1 to find the optimal points on the convex hull in each setting, marked with black crosses. Row two: We fit a line to the tokens per parameter of empirically optimal models and find a slightly higher, but still constant, tokens per parameter prescription than hoffmann2022empirical. hoffmann2022empirical's Approach 1 creates $250$ logarithmically-spaced FLOPs bins per order of magnitude, and in red we plot the minimizers over these bins, and the scaling law fitted to these minimizers (binning). Clearly, their Approach 1 is not well-suited for our data, and our convex hull approach is better when we select fewer models to fit our law on. Extended plot in \ref{['fig:approach-1-full']}.
  • Figure 4: Loss over multiple webtext datasets. We see that the loss value changes for different datasets, including Dolma which we train on. DCLM and FineWeb have higher loss values whereas we measure lower loss values on FineWeb-Edu and Dolma. However, the rank order between models is stable across datasets. This suggests that it may be valid to fit scaling laws on various validation sets without necessarily needing to retrain the underlying models regardless of whether the validation data is i.i.d. with respect to the training distribution.
  • Figure 5: Benchmark Scaling Law for Error. We fit a law of the form shown in \ref{['eq:benchmark-fitting']} to benchmark results sampled at every $10$ billion tokens and observe a tight fit. $\text{Err}(L) = \epsilon - k \cdot \exp(-\gamma L)$
  • ...and 19 more figures