An exactly solvable model for emergence and scaling laws in the multitask sparse parity problem
Yoonsoo Nam, Nayara Fonseca, Seok Hyeong Lee, Chris Mingard, Ard A. Louis
TL;DR
This work formulates an analytically tractable framework for emergence and scaling in multitask learning by representing skills as an orthogonal basis of functions and solving a multilinear model. It derives exact scaling laws for loss with respect to training time $T$, data $D$, parameters $N$, and compute $C=N\times T$, and demonstrates stage-like, sigmoidal skill emergence consistent with neural network observations. Calibrating the model on the first skill enables accurate prediction of subsequent skill emergences in a 2-layer MLP and a transformer, linking feature learning to emergent capabilities. The study further extends the model to account for data-shot learning ($D_c$-shot) and parameter-shot learning ($N_c$-shot), highlighting tradeoffs between data, compute, and representation capacity while acknowledging limitations due to decoupled dynamics. Overall, the results suggest that hierarchical, stage-like learning across a power-law distribution of skill frequencies can reproduce key qualitative and quantitative aspects of emergence in neural systems, offering a compact lens on how complex abilities arise with scale.
Abstract
Deep learning models can exhibit what appears to be a sudden ability to solve a new problem as training time, training data, or model size increases, a phenomenon known as emergence. In this paper, we present a framework where each new ability (a skill) is represented as a basis function. We solve a simple multi-linear model in this skill-basis, finding analytic expressions for the emergence of new skills, as well as for scaling laws of the loss with training time, data size, model size, and optimal compute. We compare our detailed calculations to direct simulations of a two-layer neural network trained on multitask sparse parity, where the tasks in the dataset are distributed according to a power-law. Our simple model captures, using a single fit parameter, the sigmoidal emergence of multiple new skills as training time, data size or model size increases in the neural network.
