Table of Contents
Fetching ...

Scaling Law for Stochastic Gradient Descent in Quadratically Parameterized Linear Regression

Shihong Ding, Haihan Zhang, Hanzhen Zhao, Cong Fang

TL;DR

The paper extends neural scaling law insights to a quadratically parameterized regression under power-law spectra, showing SGD exhibits a two-phase learning dynamic with phase I adaptation and phase II estimation. By leveraging anisotropic Gaussian data and ground-truth decay, it derives an explicit excess-risk bound depending on the effective dimension $D$ and sample size $T$, and reveals problem-adaptive rates: $\tilde{O}(T^{-1+1/\beta})$ when $\alpha\le\beta$ and $\tilde{O}(T^{-(2\beta-2)/(\,\alpha+\beta\,)})$ when $\alpha>\beta$. The analysis demonstrates feature learning in the quadratic model can outperform linear baselines in certain spectral regimes and, in a complementary regime, attain information-theoretic optimality. The work provides a rigorous two-phase SGD framework with coupling arguments and phase-wise decompositions, offering insight into implicit regularization and scalable generalization for models with learned features.

Abstract

In machine learning, the scaling law describes how the model performance improves with the model and data size scaling up. From a learning theory perspective, this class of results establishes upper and lower generalization bounds for a specific learning algorithm. Here, the exact algorithm running using a specific model parameterization often offers a crucial implicit regularization effect, leading to good generalization. To characterize the scaling law, previous theoretical studies mainly focus on linear models, whereas, feature learning, a notable process that contributes to the remarkable empirical success of neural networks, is regretfully vacant. This paper studies the scaling law over a linear regression with the model being quadratically parameterized. We consider infinitely dimensional data and slope ground truth, both signals exhibiting certain power-law decay rates. We study convergence rates for Stochastic Gradient Descent and demonstrate the learning rates for variables will automatically adapt to the ground truth. As a result, in the canonical linear regression, we provide explicit separations for generalization curves between SGD with and without feature learning, and the information-theoretical lower bound that is agnostic to parametrization method and the algorithm. Our analysis for decaying ground truth provides a new characterization for the learning dynamic of the model.

Scaling Law for Stochastic Gradient Descent in Quadratically Parameterized Linear Regression

TL;DR

The paper extends neural scaling law insights to a quadratically parameterized regression under power-law spectra, showing SGD exhibits a two-phase learning dynamic with phase I adaptation and phase II estimation. By leveraging anisotropic Gaussian data and ground-truth decay, it derives an explicit excess-risk bound depending on the effective dimension and sample size , and reveals problem-adaptive rates: when and when . The analysis demonstrates feature learning in the quadratic model can outperform linear baselines in certain spectral regimes and, in a complementary regime, attain information-theoretic optimality. The work provides a rigorous two-phase SGD framework with coupling arguments and phase-wise decompositions, offering insight into implicit regularization and scalable generalization for models with learned features.

Abstract

In machine learning, the scaling law describes how the model performance improves with the model and data size scaling up. From a learning theory perspective, this class of results establishes upper and lower generalization bounds for a specific learning algorithm. Here, the exact algorithm running using a specific model parameterization often offers a crucial implicit regularization effect, leading to good generalization. To characterize the scaling law, previous theoretical studies mainly focus on linear models, whereas, feature learning, a notable process that contributes to the remarkable empirical success of neural networks, is regretfully vacant. This paper studies the scaling law over a linear regression with the model being quadratically parameterized. We consider infinitely dimensional data and slope ground truth, both signals exhibiting certain power-law decay rates. We study convergence rates for Stochastic Gradient Descent and demonstrate the learning rates for variables will automatically adapt to the ground truth. As a result, in the canonical linear regression, we provide explicit separations for generalization curves between SGD with and without feature learning, and the information-theoretical lower bound that is agnostic to parametrization method and the algorithm. Our analysis for decaying ground truth provides a new characterization for the learning dynamic of the model.

Paper Structure

This paper contains 24 sections, 36 theorems, 186 equations, 2 figures, 1 algorithm.

Key Result

Theorem 4.1

Under Assumptions ass-d and ass-ss, we consider a predictor trained by Algorithm SGD with total sample size $T$ and middle phase length $h=\lceil T/\log(T)\rceil$. Let $D\asymp\min\{T^{1/\max\{\beta,(\alpha+\beta)/2\}},M\}$ and $\eta\asymp D^{\min\{0,(\alpha-\beta)/4\}}$. The error of output can be with probability at least 0.95.

Figures (2)

  • Figure 1: Empirical results on the convergence rate of quadratic model with spectral decay v.s. traditional linear model. (a) and (b) show the curve of mean error against the number of iteration steps, with $\alpha = 2.5, \beta = 1.5$ in (a) and $\alpha =3,\beta=2$ in (b), respectively. (c) show the logarithmic curve of final mean loss against the sample size, where the solid lines represent the empirical results and the dashed lines represent the theoretical rates.
  • Figure 2: Numerical simulation results.

Theorems & Definitions (67)

  • Remark 3.2
  • Remark 3.4
  • Theorem 4.1
  • Corollary 4.2
  • Corollary 4.3
  • Remark 4.4
  • Remark 4.5
  • Theorem 5.1
  • Lemma 5.2
  • Lemma 5.3
  • ...and 57 more