Table of Contents
Fetching ...

Generalizing Scaling Laws for Dense and Sparse Large Language Models

Md Arafat Hossain, Xingfu Wu, Valerie Taylor, Ali Jannesari

TL;DR

The paper tackles predicting pretraining performance and resource needs for both dense and sparse large language models. It introduces a generalized scaling law that unifies dense Hoffmann-style and sparse Frantar/Abnar-style laws by incorporating sparsity into the active-parameter term and adding a sparsity-dependent offset, expressed as $L(N,D,S) = e (1-S)^{\gamma} + \left(a (1-S)^{\alpha} + c S\right) \frac{1}{N^{\alpha}} + \frac{b}{D^{\beta}}$. The authors demonstrate that the generalized law recovers the dense limit when $S=0$ and closely matches prior laws across pruning, MoE, and dense regimes on existing datasets, with IsoFLOP-style validations extended to MoE models such as DeepSeek-V3. They also show that Bayesian autotuning (e.g., ytopt) improves coefficient estimation, enabling practical hyperparameter optimization for budget planning.

Abstract

Despite recent advancements of large language models (LLMs), optimally predicting the model size for LLM pretraining or allocating optimal resources still remains a challenge. Several efforts have addressed the challenge by proposing different empirical scaling laws, but almost all of them are architecture-specific (dense or sparse). In this work we revisit existing empirical scaling laws and propose a generalized scaling law to provide a unified framework that is applicable to both dense and sparse large language models. We evaluate and compare our proposed scaling law with existing scaling laws and demonstrate that our proposed scaling law captures the scaling behavior of existing scaling laws. Further, we show an IsoFLOP comparison between our proposed scaling law and the state-of-the-art scaling law to illustrate the effectiveness of our proposed scaling law for Mixture-of-Expert (MoE)-based very large LLMs like DeepSeek-V3. Our proposed scaling law can be used to estimate the best model hyperparameters (Model size, Tokens and Compute) for a given sparsity or to identify the optimal sparsity for the given model hyperparameters.

Generalizing Scaling Laws for Dense and Sparse Large Language Models

TL;DR

The paper tackles predicting pretraining performance and resource needs for both dense and sparse large language models. It introduces a generalized scaling law that unifies dense Hoffmann-style and sparse Frantar/Abnar-style laws by incorporating sparsity into the active-parameter term and adding a sparsity-dependent offset, expressed as . The authors demonstrate that the generalized law recovers the dense limit when and closely matches prior laws across pruning, MoE, and dense regimes on existing datasets, with IsoFLOP-style validations extended to MoE models such as DeepSeek-V3. They also show that Bayesian autotuning (e.g., ytopt) improves coefficient estimation, enabling practical hyperparameter optimization for budget planning.

Abstract

Despite recent advancements of large language models (LLMs), optimally predicting the model size for LLM pretraining or allocating optimal resources still remains a challenge. Several efforts have addressed the challenge by proposing different empirical scaling laws, but almost all of them are architecture-specific (dense or sparse). In this work we revisit existing empirical scaling laws and propose a generalized scaling law to provide a unified framework that is applicable to both dense and sparse large language models. We evaluate and compare our proposed scaling law with existing scaling laws and demonstrate that our proposed scaling law captures the scaling behavior of existing scaling laws. Further, we show an IsoFLOP comparison between our proposed scaling law and the state-of-the-art scaling law to illustrate the effectiveness of our proposed scaling law for Mixture-of-Expert (MoE)-based very large LLMs like DeepSeek-V3. Our proposed scaling law can be used to estimate the best model hyperparameters (Model size, Tokens and Compute) for a given sparsity or to identify the optimal sparsity for the given model hyperparameters.

Paper Structure

This paper contains 12 sections, 10 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Comparison between Hoffman scaling law and Frantar scaling law at $0\%$ sparsity.
  • Figure 2: Comparison between Hoffman scaling law and Abnar scaling law at $0\%$ sparsity.
  • Figure 3: Loss prediction of Hoffman scaling law and the proposed scaling law.
  • Figure 4: As sparsity and the number of nonzero/active parameters increase, the pretraining loss decreases. In each figure, the number of non-zero parameters in the models used ranged from $1.3M-85M$ with varying sparsity levels, and the number of training tokens is mentioned in the subcaption.
  • Figure 5: Scaling behavior and prediction comparison between Abnar scaling law and our proposed scaling law.
  • ...and 3 more figures