An Information Theory of Compute-Optimal Size Scaling, Emergence, and Plateaus in Language Models

Anuj K. Nayak; Lav R. Varshney

An Information Theory of Compute-Optimal Size Scaling, Emergence, and Plateaus in Language Models

Anuj K. Nayak, Lav R. Varshney

TL;DR

A simple unified mathematical framework is presented to explain all of these language model scaling phenomena, building on recent skill-text bipartite graph frameworks for semantic learning and provides a simple explanation for both emergence of complex skills and plateauing of performance as the size of language models scale.

Abstract

Recent empirical studies show three phenomena with increasing size of language models: compute-optimal size scaling, emergent capabilities, and performance plateauing. We present a simple unified mathematical framework to explain all of these language model scaling phenomena, building on recent skill-text bipartite graph frameworks for semantic learning. Modeling the learning of concepts from texts as an iterative process yields an analogy to iterative decoding of low-density parity check (LDPC) codes in information theory. Thence, drawing on finite-size scaling characterizations of LDPC decoding, we derive the compute-optimal size scaling (Chinchilla rule) for language models. Further, using tools from random network theory, we provide a simple explanation for both emergence of complex skills and plateauing of performance as the size of language models scale. We see multiple plateaus.

An Information Theory of Compute-Optimal Size Scaling, Emergence, and Plateaus in Language Models

TL;DR

Abstract

Paper Structure (18 sections, 1 theorem, 27 equations, 5 figures)

This paper contains 18 sections, 1 theorem, 27 equations, 5 figures.

Introduction
Graph-based framework
Texts, concepts, and skills
Notation
Learning concepts from text pieces
Acquisition of skills and composition of skills
Defining emergence
Explaining all three phenomena
Compute-optimal scaling rule
Scaling of excess entropy
Emergence
Plateauing
Conclusion
Solving \ref{['eqn:maximize_skills_learnt']}: Maximizing concept learning under compute budget constraint
A brief summary of belief propagation decoding of LDPC codes under erasure
...and 3 more sections

Key Result

Proposition 1

Compute-optimal scaling rule: For compute-optimal performance of a language model, the dataset size ($D$) and model size ($N$) must scale equally with the increasing compute budget $C$ (or FLOPs).

Figures (5)

Figure 1: A unified framework of learning concepts and skills by language models. The lower subgraph $G^{(C)}_1$ is a concept-text bipartite graph akin to a Tanner graph representation of an LDPC code. The upper subgraph $G_2$ shows concept-skill and skill-to-skill relationships, with multiple levels of skills denoted by $l$. Higher $l$ indicates more advanced skills.
Figure 2: IsoFLOP curves: (left) Number of concepts learnt as a function of $R$ for different compute budgets (FLOPs); (right) Block erasure threshold as a function of the number of concepts $R$ for different compute budget. In both subfigures, solid black markers indicate the points corresponding to $R^*$.
Figure 3: (a) Model and dataset size pair $(N^*, D^*)$ that maximizes \ref{['eqn:maximize_skills_learnt']} as a function of compute budget $C$. The curves being parallel in logarithmic scale indicates that model size and dataset size must scale equally with $C$. In this subplot, we set $\varsigma = 2 \times 10^5$, $\tau = 8 \times 10^5$, and $d_t = 6$. The markers indicate $(N, D)$ corresponding to the compute-optimal performance predicted by the Chinchilla rule hoffmann2022training when compute budget is $5.76 \times 10^{23}$ (dashed vertical line); (b) Scaling of the lower bound of excess entropy in \ref{['eqn:excess_entropy_LB']} compared with empirically observed scaling according to hoffmann2022training as a function of the model size $N^*$.
Figure 4: Accuracy of the language model sharply increases after the model size (equivalently $C$) exceeds a threshold, which is a consequence of the emergence of a GCC in a skill graph $G_2^{(l)}$. (a) Step increase in accuracy for a homogeneous task. (b) Skill level distribution $q(l)$ for unimodal and multimodal heterogeneous tasks. (c) Smooth emergence for unimodal heterogeneous task. (d) Plateauing phenomena as a consequence of a task requiring diverse skills according to multimodal distribution. In this subplot, we used the following values for the parameters: number of skill levels $L = 100$, $S^{(l)} = 10^3$, $\eta_l = \exp(7l/L)$, $\sigma_l = \log_2(l)$ for all $l \in \{1,\ldots,L\}$, $q(m) = 1/6$ for all $m \in \{2,\ldots,7\}$, .
Figure 5: Bipartite graph $\widetilde{G}_1$.

Theorems & Definitions (2)

Proposition 1
proof

An Information Theory of Compute-Optimal Size Scaling, Emergence, and Plateaus in Language Models

TL;DR

Abstract

An Information Theory of Compute-Optimal Size Scaling, Emergence, and Plateaus in Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (2)