Gemstones: A Model Suite for Multi-Faceted Scaling Laws

Sean McLeish; John Kirchenbauer; David Yu Miller; Siddharth Singh; Abhinav Bhatele; Micah Goldblum; Ashwinee Panda; Tom Goldstein

Gemstones: A Model Suite for Multi-Faceted Scaling Laws

Sean McLeish, John Kirchenbauer, David Yu Miller, Siddharth Singh, Abhinav Bhatele, Micah Goldblum, Ashwinee Panda, Tom Goldstein

TL;DR

The paper investigates scaling laws for transformer models and demonstrates that their prescriptions are highly sensitive to architectural shape and training choices. It introduces Gemstones, a large open-source suite of over 4000 checkpoints across 50M–2B parameters, with varied widths, depths, tokens, and hyperparameters, enabling robust scaling analyses. A convex-hull fitting method is proposed to isolate the true frontier of optimal models, revealing fragility in prior laws and highlighting the impact of data sampling and model selection. The study also examines downstream benchmarks and compute-time versus FLOPs trade-offs, showing that depth can improve benchmark performance under the same FLOPs while width often yields better time efficiency, underscoring practical implications for model design and deployment. Overall, the work provides a reproducible framework and dataset for studying scaling laws in open-transformer regimes and clarifies how design choices shape scaling prescriptions.

Abstract

Scaling laws are typically fit using a family of models with a narrow range of frozen hyperparameter choices. In this work we study scaling laws using multiple architectural shapes and hyperparameter choices, highlighting their impact on resulting prescriptions. As a primary artifact of our research, we release the Gemstones: an open-source scaling law dataset, consisting of over 4000 checkpoints from transformers with up to 2 billion parameters and diverse architectural shapes; including ablations over learning rate and cooldown. Our checkpoints enable more complex studies of scaling, such as analyzing the relationship between width and depth. By examining our model suite, we find that the prescriptions of scaling laws can be highly sensitive to the experimental design process and the specific model checkpoints used during fitting.

Gemstones: A Model Suite for Multi-Faceted Scaling Laws

TL;DR

Abstract

Gemstones: A Model Suite for Multi-Faceted Scaling Laws

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (24)