Table of Contents
Fetching ...

A Resource Model For Neural Scaling Law

Jinyeop Song, Ziming Liu, Max Tegmark, Jeff Gore

TL;DR

The paper tackles neural scaling laws by proposing a resource-based view in which a composite task is decomposed into subtasks that compete for neuron resources. Through toy experiments, it shows single-subtask losses scale as $l \propto N^{-1}$ and that multi-subtask allocations grow homogeneously, enabling a general scaling relation and linking to $\ell \propto N_p^{-1/3}$ under width-depth scaling, consistent with Chinchilla results. It extends the framework to parallel and series compositions, arguing for homogeneous growth of neuron redundancies and linear additivity of subtasks, which yields $\ell \propto N^{-1}$ for composite tasks and suggests broad applicability to general composite tasks. The approach offers a simple, actionable lens for diagnosing and guiding neural network scaling, with implications for predicting LLM performance and for understanding modularity in deep networks.

Abstract

Neural scaling laws characterize how model performance improves as the model size scales up. Inspired by empirical observations, we introduce a resource model of neural scaling. A task is usually composite hence can be decomposed into many subtasks, which compete for resources (measured by the number of neurons allocated to subtasks). On toy problems, we empirically find that: (1) The loss of a subtask is inversely proportional to its allocated neurons. (2) When multiple subtasks are present in a composite task, the resources acquired by each subtask uniformly grow as models get larger, keeping the ratios of acquired resources constants. We hypothesize these findings to be generally true and build a model to predict neural scaling laws for general composite tasks, which successfully replicates the neural scaling law of Chinchilla models reported in arXiv:2203.15556. We believe that the notion of resource used in this paper will be a useful tool for characterizing and diagnosing neural networks.

A Resource Model For Neural Scaling Law

TL;DR

The paper tackles neural scaling laws by proposing a resource-based view in which a composite task is decomposed into subtasks that compete for neuron resources. Through toy experiments, it shows single-subtask losses scale as and that multi-subtask allocations grow homogeneously, enabling a general scaling relation and linking to under width-depth scaling, consistent with Chinchilla results. It extends the framework to parallel and series compositions, arguing for homogeneous growth of neuron redundancies and linear additivity of subtasks, which yields for composite tasks and suggests broad applicability to general composite tasks. The approach offers a simple, actionable lens for diagnosing and guiding neural network scaling, with implications for predicting LLM performance and for understanding modularity in deep networks.

Abstract

Neural scaling laws characterize how model performance improves as the model size scales up. Inspired by empirical observations, we introduce a resource model of neural scaling. A task is usually composite hence can be decomposed into many subtasks, which compete for resources (measured by the number of neurons allocated to subtasks). On toy problems, we empirically find that: (1) The loss of a subtask is inversely proportional to its allocated neurons. (2) When multiple subtasks are present in a composite task, the resources acquired by each subtask uniformly grow as models get larger, keeping the ratios of acquired resources constants. We hypothesize these findings to be generally true and build a model to predict neural scaling laws for general composite tasks, which successfully replicates the neural scaling law of Chinchilla models reported in arXiv:2203.15556. We believe that the notion of resource used in this paper will be a useful tool for characterizing and diagnosing neural networks.
Paper Structure (18 sections, 2 theorems, 17 equations, 8 figures)

This paper contains 18 sections, 2 theorems, 17 equations, 8 figures.

Key Result

Theorem 1

Consider a set of target functions $f_i: x \rightarrow f_i(x)$ for $i=1,2,\ldots,Q$, and corresponding number of allocated neurons $N_i$. Consider a composite task, which is a linear and parallel combination of these regression tasks i.e composite task loss is $l = \sum_{i \in Q}\alpha_i {\rm MSE}_i

Figures (8)

  • Figure 1: Overview of the resource model. (Top left) Neurons in a neural network play the role of resources, while (sub)tasks are consumers competing for these resources. (Top right) An example of a composite task consisting of subtasks combined in parallel and in series. (Bottom) The homogeneous growth hypothesis: when the network grows wider, each task will acquire more resources (allocated neurons), while the ratios of their acquired resources are kept constant.
  • Figure 2: (A) Toy experiment : a single $x^2$ regression task experiment. (B)(top) The weight and bias map of fixed seeds are plotted for four representative values of $\alpha=3.5, 9.0, 19.3, 128$. Dots which have nonzero weights are classified as allocated to $x^2$ module and colored as red. Number of allocated neurons $N$ and task loss $\ell$ are plotted relative $\alpha$ (bottom). (C) The $N^{-1}$ scaling between number of allocated neurons and task loss.
  • Figure 3: (A) A neural network with one hidden layer is tasked to perform two independent squared regression tasks in parallel. We vary hyperparameters $\alpha$ for overall task intensity and $\beta$ for relative weights of two tasks. We annotated neurons allocated for the first and the second tasks red and blue colors, respectively. (B) Resources of the first and second tasks increase together as $\alpha$ increases. Surprisingly, we observe the constant ratios of allocated neurons across a range of $\alpha$, which serve as the basis for the homogeneous growth of resource allocation hypothesis. (C) The ratios of MSE and ratios of allocated neurons are insensitive to $\alpha$ but depend heavily on $\beta$. (D) We observe the emergence of $N^{-1}$ scaling for the composite task of two tasks combined in parallel.
  • Figure 4: Composition of tasks in series (A) A neural network with four hidden layer is tasked to perform the regression of composite function of $f$ and $g$. We vary hyperparameters $\alpha$ for overall task intensity. (B) (top) Weight structures of trained model are visualized for $\alpha=80, 1280$. (bottom) Neurons in the third layers of both model exhibit high correlation with the $g(x)$, green for postiive and purple for negative correlation. We found the evidence that the first three layers perform $g(x)$ and the last layer perform $f(x)$. (C) We observe the $N^{-1}$ scaling for the composite task loss of two tasks combined in series.
  • Figure 5: Scaling relationship between the number of allocated neurons and task loss across various regression tasks. Simulations were conducted for three different network depths, with hidden layer configurations of [1000], [200, 200], and [150, 150, 150], which are represented by the colors red, yellow, and blue, respectively.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Theorem 1: Scaling of Tasks Combined in Parallel
  • Theorem 2: scaling of tasks combined in series