Table of Contents
Fetching ...

An Information Theory of Compute-Optimal Size Scaling, Emergence, and Plateaus in Language Models

Anuj K. Nayak, Lav R. Varshney

TL;DR

A simple unified mathematical framework is presented to explain all of these language model scaling phenomena, building on recent skill-text bipartite graph frameworks for semantic learning and provides a simple explanation for both emergence of complex skills and plateauing of performance as the size of language models scale.

Abstract

Recent empirical studies show three phenomena with increasing size of language models: compute-optimal size scaling, emergent capabilities, and performance plateauing. We present a simple unified mathematical framework to explain all of these language model scaling phenomena, building on recent skill-text bipartite graph frameworks for semantic learning. Modeling the learning of concepts from texts as an iterative process yields an analogy to iterative decoding of low-density parity check (LDPC) codes in information theory. Thence, drawing on finite-size scaling characterizations of LDPC decoding, we derive the compute-optimal size scaling (Chinchilla rule) for language models. Further, using tools from random network theory, we provide a simple explanation for both emergence of complex skills and plateauing of performance as the size of language models scale. We see multiple plateaus.

An Information Theory of Compute-Optimal Size Scaling, Emergence, and Plateaus in Language Models

TL;DR

A simple unified mathematical framework is presented to explain all of these language model scaling phenomena, building on recent skill-text bipartite graph frameworks for semantic learning and provides a simple explanation for both emergence of complex skills and plateauing of performance as the size of language models scale.

Abstract

Recent empirical studies show three phenomena with increasing size of language models: compute-optimal size scaling, emergent capabilities, and performance plateauing. We present a simple unified mathematical framework to explain all of these language model scaling phenomena, building on recent skill-text bipartite graph frameworks for semantic learning. Modeling the learning of concepts from texts as an iterative process yields an analogy to iterative decoding of low-density parity check (LDPC) codes in information theory. Thence, drawing on finite-size scaling characterizations of LDPC decoding, we derive the compute-optimal size scaling (Chinchilla rule) for language models. Further, using tools from random network theory, we provide a simple explanation for both emergence of complex skills and plateauing of performance as the size of language models scale. We see multiple plateaus.
Paper Structure (18 sections, 1 theorem, 27 equations, 5 figures)

This paper contains 18 sections, 1 theorem, 27 equations, 5 figures.

Key Result

Proposition 1

Compute-optimal scaling rule: For compute-optimal performance of a language model, the dataset size ($D$) and model size ($N$) must scale equally with the increasing compute budget $C$ (or FLOPs).

Figures (5)

  • Figure 1: A unified framework of learning concepts and skills by language models. The lower subgraph $G^{(C)}_1$ is a concept-text bipartite graph akin to a Tanner graph representation of an LDPC code. The upper subgraph $G_2$ shows concept-skill and skill-to-skill relationships, with multiple levels of skills denoted by $l$. Higher $l$ indicates more advanced skills.
  • Figure 2: IsoFLOP curves: (left) Number of concepts learnt as a function of $R$ for different compute budgets (FLOPs); (right) Block erasure threshold as a function of the number of concepts $R$ for different compute budget. In both subfigures, solid black markers indicate the points corresponding to $R^*$.
  • Figure 3: (a) Model and dataset size pair $(N^*, D^*)$ that maximizes \ref{['eqn:maximize_skills_learnt']} as a function of compute budget $C$. The curves being parallel in logarithmic scale indicates that model size and dataset size must scale equally with $C$. In this subplot, we set $\varsigma = 2 \times 10^5$, $\tau = 8 \times 10^5$, and $d_t = 6$. The markers indicate $(N, D)$ corresponding to the compute-optimal performance predicted by the Chinchilla rule hoffmann2022training when compute budget is $5.76 \times 10^{23}$ (dashed vertical line); (b) Scaling of the lower bound of excess entropy in \ref{['eqn:excess_entropy_LB']} compared with empirically observed scaling according to hoffmann2022training as a function of the model size $N^*$.
  • Figure 4: Accuracy of the language model sharply increases after the model size (equivalently $C$) exceeds a threshold, which is a consequence of the emergence of a GCC in a skill graph $G_2^{(l)}$. (a) Step increase in accuracy for a homogeneous task. (b) Skill level distribution $q(l)$ for unimodal and multimodal heterogeneous tasks. (c) Smooth emergence for unimodal heterogeneous task. (d) Plateauing phenomena as a consequence of a task requiring diverse skills according to multimodal distribution. In this subplot, we used the following values for the parameters: number of skill levels $L = 100$, $S^{(l)} = 10^3$, $\eta_l = \exp(7l/L)$, $\sigma_l = \log_2(l)$ for all $l \in \{1,\ldots,L\}$, $q(m) = 1/6$ for all $m \in \{2,\ldots,7\}$, .
  • Figure 5: Bipartite graph $\widetilde{G}_1$.

Theorems & Definitions (2)

  • Proposition 1
  • proof