Table of Contents
Fetching ...

Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions

Emmy Liu, Amanda Bertsch, Lintang Sutawika, Lindia Tjuatja, Patrick Fernandes, Lara Marinov, Michael Chen, Shreya Singhal, Carolin Lawrence, Aditi Raghunathan, Kiril Gashteovski, Graham Neubig

TL;DR

Not-Just-Scaling-Laws investigates why downstream capabilities of language systems depend on more than scale. The authors assemble a database of 92 publicly available decoder-only transformers and extract architecture, data, and free-generation features to train predictive regressors for 12 benchmarks, explicitly modeling the interaction between design decisions and performance through a relation that can be viewed as a generalization of the scaling law L(N,D) with respect to $N$ and $D$ and the loss $L(N,D)$. They show that including design features reduces predictive error relative to scale-only models, with MAE improvements of about 3–28% and notable gains on code generation and NL reasoning tasks, while data-domain and generation-pattern features reveal systematic effects of data composition on task performance. The work also identifies practical guidelines, such as maintaining a balanced code proportion around 15–25% in pretraining and recognizing that certain architectural choices (e.g., rotary vs learned embeddings) influence downstream tasks, providing a foundation for more systematic exploration of LM design decisions. Altogether, the study offers a public resource and a framework for predicting and guiding downstream performance through design choices beyond scaling alone.

Abstract

Improvements in language model capabilities are often attributed to increasing model size or training data, but in some cases smaller models trained on curated data or with different architectural decisions can outperform larger ones trained on more tokens. What accounts for this? To quantify the impact of these design choices, we meta-analyze 92 open-source pretrained models across a wide array of scales, including state-of-the-art open-weights models as well as less performant models and those with less conventional design decisions. We find that by incorporating features besides model size and number of training tokens, we can achieve a relative 3-28% increase in ability to predict downstream performance compared with using scale alone. Analysis of model design decisions reveal insights into data composition, such as the trade-off between language and code tasks at 15-25\% code, as well as the better performance of some architectural decisions such as choosing rotary over learned embeddings. Broadly, our framework lays a foundation for more systematic investigation of how model development choices shape final capabilities.

Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions

TL;DR

Not-Just-Scaling-Laws investigates why downstream capabilities of language systems depend on more than scale. The authors assemble a database of 92 publicly available decoder-only transformers and extract architecture, data, and free-generation features to train predictive regressors for 12 benchmarks, explicitly modeling the interaction between design decisions and performance through a relation that can be viewed as a generalization of the scaling law L(N,D) with respect to and and the loss . They show that including design features reduces predictive error relative to scale-only models, with MAE improvements of about 3–28% and notable gains on code generation and NL reasoning tasks, while data-domain and generation-pattern features reveal systematic effects of data composition on task performance. The work also identifies practical guidelines, such as maintaining a balanced code proportion around 15–25% in pretraining and recognizing that certain architectural choices (e.g., rotary vs learned embeddings) influence downstream tasks, providing a foundation for more systematic exploration of LM design decisions. Altogether, the study offers a public resource and a framework for predicting and guiding downstream performance through design choices beyond scaling alone.

Abstract

Improvements in language model capabilities are often attributed to increasing model size or training data, but in some cases smaller models trained on curated data or with different architectural decisions can outperform larger ones trained on more tokens. What accounts for this? To quantify the impact of these design choices, we meta-analyze 92 open-source pretrained models across a wide array of scales, including state-of-the-art open-weights models as well as less performant models and those with less conventional design decisions. We find that by incorporating features besides model size and number of training tokens, we can achieve a relative 3-28% increase in ability to predict downstream performance compared with using scale alone. Analysis of model design decisions reveal insights into data composition, such as the trade-off between language and code tasks at 15-25\% code, as well as the better performance of some architectural decisions such as choosing rotary over learned embeddings. Broadly, our framework lays a foundation for more systematic investigation of how model development choices shape final capabilities.

Paper Structure

This paper contains 64 sections, 2 equations, 15 figures, 10 tables.

Figures (15)

  • Figure 1: We document design decisions from open-weights models related to both architecture and data composition, and train predictors for downstream task performance. This allows us to examine the impact of model design choices individually.
  • Figure 2: Taxonomy of pretraining data categories. We sorted data sources into this taxonomy based on model documentation.
  • Figure 3: Performance of plotted against their total parameters and tokens. The background colour represents \ref{['eq:kaplan-scaling']} fitted to the task, and the marker colours indicate true performance. Some tasks have different performance trends with scale. Within each task, individual models may also perform unexpectedly.
  • Figure 4: In all tasks, the number of parameters and pretraining tokens heavily influences the predictions made by the regressor. The percentage of code in pretraining often influences predictions negatively for NLI tasks but positively for Humaneval. [D], [A] and [F] denote features derived from data, architecture, or free-generations of a model respectively.
  • Figure 5: SHAP impact of code percentage on Lambada (reprentative NL task) and Humaneval on our regressors.
  • ...and 10 more figures