gzip Predicts Data-dependent Scaling Laws
Rohan Pandey
TL;DR
The paper challenges the assumption that neural language-model scaling laws are data-agnostic by showing that data complexity, proxied by gzip compressibility, meaningfully alters the data-to-parameter trade-off. Using PCFG-generated datasets with controlled syntactic complexity, it demonstrates that scaling-law parameters vary across datasets and that harder-to-compress data shifts the compute-optimal frontier toward data more than parameters. It further derives a data-dependent scaling framework and demonstrates confounder control, with implications for code versus natural language and potential practical uses in data curation and curriculum design. The work highlights a promising information-theoretic direction for understanding and optimizing compute in language modeling, while noting the synthetic setting and the need for broader validation.
Abstract
Past work has established scaling laws that predict the performance of a neural language model (LM) as a function of its parameter count and the number of tokens it's trained on, enabling optimal allocation of a fixed compute budget. Are these scaling laws agnostic to training data as some prior work suggests? We generate training datasets of varying complexities by modulating the syntactic properties of a PCFG, finding that 1) scaling laws are sensitive to differences in data complexity and that 2) gzip, a compression algorithm, is an effective predictor of how data complexity impacts scaling properties. We propose a new data-dependent scaling law for LM's that accounts for the training data's gzip-compressibility; its compute-optimal frontier increases in dataset size preference (over parameter count preference) as training data becomes harder to compress.
