Table of Contents
Fetching ...

Least Squares Regression Can Exhibit Under-Parameterized Double Descent

Xinyue Li, Rishi Sonthalia

TL;DR

It is postulate that the location of the peak depends on the technical properties of both the spectrum as well as the eigenvectors of the sample covariance and that the standard bias-variance trade-off holds in the under-parameterized regime.

Abstract

The relationship between the number of training data points, the number of parameters, and the generalization capabilities of models has been widely studied. Previous work has shown that double descent can occur in the over-parameterized regime and that the standard bias-variance trade-off holds in the under-parameterized regime. These works provide multiple reasons for the existence of the peak. We postulate that the location of the peak depends on the technical properties of both the spectrum as well as the eigenvectors of the sample covariance. We present two simple examples that provably exhibit double descent in the under-parameterized regime and do not seem to occur for reasons provided in prior work.

Least Squares Regression Can Exhibit Under-Parameterized Double Descent

TL;DR

It is postulate that the location of the peak depends on the technical properties of both the spectrum as well as the eigenvectors of the sample covariance and that the standard bias-variance trade-off holds in the under-parameterized regime.

Abstract

The relationship between the number of training data points, the number of parameters, and the generalization capabilities of models has been widely studied. Previous work has shown that double descent can occur in the over-parameterized regime and that the standard bias-variance trade-off holds in the under-parameterized regime. These works provide multiple reasons for the existence of the peak. We postulate that the location of the peak depends on the technical properties of both the spectrum as well as the eigenvectors of the sample covariance. We present two simple examples that provably exhibit double descent in the under-parameterized regime and do not seem to occur for reasons provided in prior work.
Paper Structure (40 sections, 35 theorems, 208 equations, 17 figures, 1 table)

This paper contains 40 sections, 35 theorems, 208 equations, 17 figures, 1 table.

Key Result

Theorem 1

Suppose the training data $X_{trn}$ and test data $X_{tst}$ satisfy Assumption data:1 and the noise $A_{trn}, A_{tst}$ satisfy Assumption noise:1. Let $\mu$ be the regularization parameter. Then for the under-parameterized regime (i.e., $c < 1$) for the solution $W_{opt}$ to Problem prob:denoise, th where

Figures (17)

  • Figure 1: Bias-variance trade-off and double descent.
  • Figure 2: Figure showing the theoretical risk curve from Theorem \ref{['thm:result']} and empirical values in the data scaling regime for different values of $\mu$ [(L) $\mu = 0.1$, (C) $\mu = 1$, (R) $\mu=2$]. Here $\sigma_{trn}=\sqrt{n}, \sigma_{tst} = \sqrt{n_{tst}}, d=1000, n_{tst} = 1000$. For each empirical point, we ran at least 100 trials. More details can be found in Appendix \ref{['app:numerical']}.
  • Figure 3: Risk for the ablation experiment. Left: Empirical Expected Risk when using $\tilde{A}$ for the noise. Right: Empirical risk when we replace $\hat{V}$ with a random orthogonal matrix.
  • Figure 4: Figure showing the theoretical risk curve from Theorem \ref{['thm:result']} and empirical values in the parameter scaling regime for different values of $\mu$ [(L) $\mu=0.1$, (C) $\mu=0.2$, (R) $\mu=0.5$]. Here, only $\mu=0.1$ has a local peak. Here $n = n_{tst} = 1000$ and $\sigma_{trn} = \sigma_{tst} = \sqrt{1000}$. Each empirical point is an average of 100 trials.
  • Figure 5: Figure showing generalization error versus $\mathbb{E}\left[\|W_{opt}\|_F^2\right]$ for the parameter scaling regime for three different values of $\mu$.
  • ...and 12 more figures

Theorems & Definitions (65)

  • Definition 1
  • Definition 2
  • Theorem 1: Generalization Error Formula
  • proof : Sketch
  • Theorem 2: Under-Parameterized Peak
  • Theorem 3: $\|W_{opt}\|_F$ Peak
  • Theorem 4: Under-parametrized Peak
  • Theorem 5: Training Error
  • Proposition 1: Optimal $\sigma_{trn}$
  • Theorem 5: Under-parametrized Peak
  • ...and 55 more