Table of Contents
Fetching ...

gzip Predicts Data-dependent Scaling Laws

Rohan Pandey

TL;DR

The paper challenges the assumption that neural language-model scaling laws are data-agnostic by showing that data complexity, proxied by gzip compressibility, meaningfully alters the data-to-parameter trade-off. Using PCFG-generated datasets with controlled syntactic complexity, it demonstrates that scaling-law parameters vary across datasets and that harder-to-compress data shifts the compute-optimal frontier toward data more than parameters. It further derives a data-dependent scaling framework and demonstrates confounder control, with implications for code versus natural language and potential practical uses in data curation and curriculum design. The work highlights a promising information-theoretic direction for understanding and optimizing compute in language modeling, while noting the synthetic setting and the need for broader validation.

Abstract

Past work has established scaling laws that predict the performance of a neural language model (LM) as a function of its parameter count and the number of tokens it's trained on, enabling optimal allocation of a fixed compute budget. Are these scaling laws agnostic to training data as some prior work suggests? We generate training datasets of varying complexities by modulating the syntactic properties of a PCFG, finding that 1) scaling laws are sensitive to differences in data complexity and that 2) gzip, a compression algorithm, is an effective predictor of how data complexity impacts scaling properties. We propose a new data-dependent scaling law for LM's that accounts for the training data's gzip-compressibility; its compute-optimal frontier increases in dataset size preference (over parameter count preference) as training data becomes harder to compress.

gzip Predicts Data-dependent Scaling Laws

TL;DR

The paper challenges the assumption that neural language-model scaling laws are data-agnostic by showing that data complexity, proxied by gzip compressibility, meaningfully alters the data-to-parameter trade-off. Using PCFG-generated datasets with controlled syntactic complexity, it demonstrates that scaling-law parameters vary across datasets and that harder-to-compress data shifts the compute-optimal frontier toward data more than parameters. It further derives a data-dependent scaling framework and demonstrates confounder control, with implications for code versus natural language and potential practical uses in data curation and curriculum design. The work highlights a promising information-theoretic direction for understanding and optimizing compute in language modeling, while noting the synthetic setting and the need for broader validation.

Abstract

Past work has established scaling laws that predict the performance of a neural language model (LM) as a function of its parameter count and the number of tokens it's trained on, enabling optimal allocation of a fixed compute budget. Are these scaling laws agnostic to training data as some prior work suggests? We generate training datasets of varying complexities by modulating the syntactic properties of a PCFG, finding that 1) scaling laws are sensitive to differences in data complexity and that 2) gzip, a compression algorithm, is an effective predictor of how data complexity impacts scaling properties. We propose a new data-dependent scaling law for LM's that accounts for the training data's gzip-compressibility; its compute-optimal frontier increases in dataset size preference (over parameter count preference) as training data becomes harder to compress.
Paper Structure (13 sections, 6 equations, 9 figures, 6 tables)

This paper contains 13 sections, 6 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Plot of different linguistic distributions and their compressibility by gzip on their raw string & their token sequence. The large grey and orange clusters are real-world code and natural language, respectively. The smaller clusters are sampled from PCFG's of increasingly complex syntax (with specific values for properties provided in legend). Observe that 1) natural language is more difficult to compress than code, and 2) as our PCFGs' syntactic complexity increases, the token sequences become less compressible. PCFG data is more tightly clustered due to length normalization.
  • Figure 2: Training curves for models of various sizes trained on 100M token datasets with gzip-compressibilities of 0.11, 0.22, 0.35, 0.42, 0.51, and 0.61. The harder the data is to compress, the longer it takes to train to convergence.
  • Figure 3: The stationary grey dashed line is the compute-optimal frontier (Eq. \ref{['eq:optimal_frontier']}) of the allegedly data-agnostic Chinchilla scaling law, and the black line is the same frontier computed from our fitted scaling law for each PCFG dataset. Points are individual training runs colored by their loss. As the data becomes more difficult to compress, the scaling law shifts from being more parameter-preferent (a) than Chinchilla to far more data-preferent (d). Compare Appendix Fig. \ref{['fig:scaling_contours']} for parameter-data plot.
  • Figure 4: As data becomes harder to compress, the scaling law's fitted parameters trend downwards at different rates ($p < 0.05$), with $\alpha$ and $\beta$ intersecting around 0.27—the compressibility point at which parameters & data should be scaled roughly equiproportionally.
  • Figure 5: When holding terminal count (vocab size) the same and varying other syntactic properties, scaling laws still shift across gzip-compressibility, with even stronger correlation.
  • ...and 4 more figures