Scaling Laws for Sparsely-Connected Foundation Models

Elias Frantar; Carlos Riquelme; Neil Houlsby; Dan Alistarh; Utku Evci

Scaling Laws for Sparsely-Connected Foundation Models

Elias Frantar, Carlos Riquelme, Neil Houlsby, Dan Alistarh, Utku Evci

TL;DR

<3-5 sentence high-level summary> This paper establishes a novel scaling framework for weight-sparse Transformers trained on massive datasets, introducing a joint law $L(S,N,D)$ that captures how sparsity $S$, non-zero parameter count $N$, and training data/steps $D$ jointly determine pretraining loss. By combining a dense scaling baseline with a saturating sparsity factor, the authors demonstrate that sparsity affects model capacity in a roughly multiplicative way across sizes and that data scaling largely preserves the original dense term; they validate the law across ViT/JFT-4B and T5/C4 and derive an explicit expression for optimal sparsity $S_{opt}(N,C)$ under compute constraints. The work shows that optimal sparsity increases with longer training and that, for many settings, 50–75% sparsity yields notable gains (up to about 2x-equivalent dense capacity) while structured sparsity patterns (2:4, 4:8) behave similarly to unstructured sparsity. They further explore practical extensions, such as pruning pretrained models and N:M sparsity, and discuss how these findings inform when sparsity is a viable route for efficiency under real hardware and compute budgets.

Abstract

We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets (i.e., "foundation models"), in both vision and language domains. In this setting, we identify the first scaling law describing the relationship between weight sparsity, number of non-zero parameters, and amount of training data, which we validate empirically across model and data scales; on ViT/JFT-4B and T5/C4. These results allow us to characterize the "optimal sparsity", the sparsity level which yields the best performance for a given effective model size and training budget. For a fixed number of non-zero parameters, we identify that the optimal sparsity increases with the amount of data used for training. We also extend our study to different sparsity structures (such as the hardware-friendly n:m pattern) and strategies (such as starting from a pretrained dense model). Our findings shed light on the power and limitations of weight sparsity across various parameter and computational settings, offering both theoretical understanding and practical implications for leveraging sparsity towards computational efficiency improvements.

Scaling Laws for Sparsely-Connected Foundation Models

TL;DR

<3-5 sentence high-level summary> This paper establishes a novel scaling framework for weight-sparse Transformers trained on massive datasets, introducing a joint law

that captures how sparsity

, non-zero parameter count

, and training data/steps

jointly determine pretraining loss. By combining a dense scaling baseline with a saturating sparsity factor, the authors demonstrate that sparsity affects model capacity in a roughly multiplicative way across sizes and that data scaling largely preserves the original dense term; they validate the law across ViT/JFT-4B and T5/C4 and derive an explicit expression for optimal sparsity

under compute constraints. The work shows that optimal sparsity increases with longer training and that, for many settings, 50–75% sparsity yields notable gains (up to about 2x-equivalent dense capacity) while structured sparsity patterns (2:4, 4:8) behave similarly to unstructured sparsity. They further explore practical extensions, such as pruning pretrained models and N:M sparsity, and discuss how these findings inform when sparsity is a viable route for efficiency under real hardware and compute budgets.

Abstract

Paper Structure (40 sections, 7 equations, 6 figures, 7 tables, 1 algorithm)

This paper contains 40 sections, 7 equations, 6 figures, 7 tables, 1 algorithm.

Introduction
Fair Evaluation in the Presence of Strong Scaling
Scaling Laws for Parameter-Sparse Transformers
Experimental Setup
Overview.
Sweep grids.
Deriving the Core Law
Dense scaling.
Preliminary observations.
Sparse scaling law.
T5/C4 results.
ViT/JFT-4B results.
Optimal Sparsity
Empirical results.
Limit Performance
...and 25 more sections

Figures (6)

Figure 1: (Left) Fit and extrapolation quality of the $L(S, N, D)$ scaling law on T5/C4. (Right) Optimal sparsity $S_\text{opt}$ contours fitted on ViT/JFT, for sparse and dense costs (details in Section \ref{['sec:optimal-sparsity']}).
Figure 2: Visualization of T5/C4 sweep results for all sizes and sparsities, grouped by training steps.
Figure 3: Visual comparison of the ViT scaling sweep data and the corresponding fitted scaling law.
Figure 4: Optimal T5 sparsity contours.
Figure 5: Loss vs. sparse pretraining FLOPs for ViT models of varying sparsity.
...and 1 more figures

Scaling Laws for Sparsely-Connected Foundation Models

TL;DR

Abstract

Scaling Laws for Sparsely-Connected Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)