Table of Contents
Fetching ...

Larger Datasets Can Be Repeated More: A Theoretical Analysis of Multi-Epoch Scaling in Linear Regression

Tingkai Yan, Haodong Wen, Binghui Li, Kairong Luo, Wenguang Chen, Kaifeng Lyu

TL;DR

This paper presents a theoretical analysis of how a common workaround, training for multiple epochs on the same dataset, reshapes the data scaling laws in linear regression, and characterizes the scaling behavior of E(K, N) for SGD in linear regression under either strong convexity or Zipf-distributed data.

Abstract

While data scaling laws of large language models (LLMs) have been widely examined in the one-pass regime with massive corpora, their form under limited data and repeated epochs remains largely unexplored. This paper presents a theoretical analysis of how a common workaround, training for multiple epochs on the same dataset, reshapes the data scaling laws in linear regression. Concretely, we ask: to match the performance of training on a dataset of size $N$ for $K$ epochs, how much larger must a dataset be if the model is trained for only one pass? We quantify this using the \textit{effective reuse rate} of the data, $E(K, N)$, which we define as the multiplicative factor by which the dataset must grow under one-pass training to achieve the same test loss as $K$-epoch training. Our analysis precisely characterizes the scaling behavior of $E(K, N)$ for SGD in linear regression under either strong convexity or Zipf-distributed data: (1) When $K$ is small, we prove that $E(K, N) \approx K$, indicating that every new epoch yields a linear gain; (2) As $K$ increases, $E(K, N)$ plateaus at a problem-dependent value that grows with $N$ ($Θ(\log N)$ for the strongly-convex case), implying that larger datasets can be repeated more times before the marginal benefit vanishes. These theoretical findings point out a neglected factor in a recent empirical study (Muennighoff et al. (2023)), which claimed that training LLMs for up to $4$ epochs results in negligible loss differences compared to using fresh data at each step, \textit{i.e.}, $E(K, N) \approx K$ for $K \le 4$ in our notation. Supported by further empirical validation with LLMs, our results reveal that the maximum $K$ value for which $E(K, N) \approx K$ in fact depends on the data size and distribution, and underscore the need to explicitly model both factors in future studies of scaling laws with data reuse.

Larger Datasets Can Be Repeated More: A Theoretical Analysis of Multi-Epoch Scaling in Linear Regression

TL;DR

This paper presents a theoretical analysis of how a common workaround, training for multiple epochs on the same dataset, reshapes the data scaling laws in linear regression, and characterizes the scaling behavior of E(K, N) for SGD in linear regression under either strong convexity or Zipf-distributed data.

Abstract

While data scaling laws of large language models (LLMs) have been widely examined in the one-pass regime with massive corpora, their form under limited data and repeated epochs remains largely unexplored. This paper presents a theoretical analysis of how a common workaround, training for multiple epochs on the same dataset, reshapes the data scaling laws in linear regression. Concretely, we ask: to match the performance of training on a dataset of size for epochs, how much larger must a dataset be if the model is trained for only one pass? We quantify this using the \textit{effective reuse rate} of the data, , which we define as the multiplicative factor by which the dataset must grow under one-pass training to achieve the same test loss as -epoch training. Our analysis precisely characterizes the scaling behavior of for SGD in linear regression under either strong convexity or Zipf-distributed data: (1) When is small, we prove that , indicating that every new epoch yields a linear gain; (2) As increases, plateaus at a problem-dependent value that grows with ( for the strongly-convex case), implying that larger datasets can be repeated more times before the marginal benefit vanishes. These theoretical findings point out a neglected factor in a recent empirical study (Muennighoff et al. (2023)), which claimed that training LLMs for up to epochs results in negligible loss differences compared to using fresh data at each step, \textit{i.e.}, for in our notation. Supported by further empirical validation with LLMs, our results reveal that the maximum value for which in fact depends on the data size and distribution, and underscore the need to explicitly model both factors in future studies of scaling laws with data reuse.

Paper Structure

This paper contains 99 sections, 39 theorems, 253 equations, 3 figures, 2 tables.

Key Result

Theorem 4.1

Under asp:strongly-convexasp:strongly-convex-parameter-priorasp:large-dataset, for multi-epoch SGD with the number of epochs $K$, dataset size of $N$, it holds that

Figures (3)

  • Figure 1: Simulation experiments for strongly-convex linear regression and the solvable case with Zipf-distributed data and power spectrum. Results show that $E(K,N)$ is approximately proportional to some function of $N$ when $N$ is relatively small, and $E(K,N)\approx K$ when $N$ is relatively large. For the solvable case with Zipf-distributed data and power spectrum, we also fit the effective reuse rate using the formula $E(K,N)=c_1N^{c_2}$ suggested by \ref{['thm:one-hot-E_K-v2']}, and the fitted exponent $c_2=0.279\approx\frac{b}{a-b}=\frac{2}{7}$ matches our theory.
  • Figure 2: The effective reuse rate $E(K,N)$ over $K$ and training curves in language model experiments. \ref{['fig:ek-k']} shows that $E(K,N)\approx K$ when $K$ is small, to be specific, $K\leq4$. \ref{['fig:loss-step']} plots the points where $E(K,N)=0.8K$ under different configurations, and we observe that $E(K,N)$ increases as $N$ increases, indicating that larger datasets can be repeated more.
  • Figure 3: The solvable cases with logarithmic power-law spectrum. $E(K,N)$ exhibits a similar behavior to that presented in \ref{['fig:convex-and-power']}. We also fit the effective reuse rate using the formula $E(K,N)=c_1\left(\log N\right)^{c_2}$ suggested by \ref{['thm:one-hot-E_K-v2']}, and the fitted exponent $c_2=2\approx b=2$ matches our theory.

Theorems & Definitions (70)

  • Definition 3.1: Effective Reuse Rate
  • Theorem 4.1: Multi-Epoch Data Scaling Law
  • Theorem 4.2
  • Lemma 4.1: Corollary of Theorem 7.1 in huang2022matrix
  • Lemma 4.2: Small $K$
  • Lemma 4.3: Large $K$
  • Lemma 4.4: Approximately Optimal Learning Rate
  • Theorem 5.1
  • Theorem 5.2: Multi-Epoch Scaling Under Power-Law Spectrum
  • Theorem 5.3: Multi-Epoch Scaling Under Logarithmic Power-Law Spectrum
  • ...and 60 more