Least Squares Regression Can Exhibit Under-Parameterized Double Descent

Xinyue Li; Rishi Sonthalia

Least Squares Regression Can Exhibit Under-Parameterized Double Descent

Xinyue Li, Rishi Sonthalia

TL;DR

It is postulate that the location of the peak depends on the technical properties of both the spectrum as well as the eigenvectors of the sample covariance and that the standard bias-variance trade-off holds in the under-parameterized regime.

Abstract

The relationship between the number of training data points, the number of parameters, and the generalization capabilities of models has been widely studied. Previous work has shown that double descent can occur in the over-parameterized regime and that the standard bias-variance trade-off holds in the under-parameterized regime. These works provide multiple reasons for the existence of the peak. We postulate that the location of the peak depends on the technical properties of both the spectrum as well as the eigenvectors of the sample covariance. We present two simple examples that provably exhibit double descent in the under-parameterized regime and do not seem to occur for reasons provided in prior work.

Least Squares Regression Can Exhibit Under-Parameterized Double Descent

TL;DR

Abstract

Paper Structure (40 sections, 35 theorems, 208 equations, 17 figures, 1 table)

This paper contains 40 sections, 35 theorems, 208 equations, 17 figures, 1 table.

Introduction
Contributions.
Organization.
Prior Work on Double Descent
Double Descent with Input Noise.
Spectral Properties of the Data Affect the Peak Location
Alignment Mismatch
Model Assumptions
Assumptions about $A$.
Expected Risk and Peak Location
The Peak Occurs Due to Alignment Mismatch
Ablation experiment
Connection to Prior Double Descent Theory
Peak in the Norm of the Estimator Does Not Imply a Peak in the Risk
Shifting Local Maximum for Stieljtes Transform as a Function of $c$
...and 25 more sections

Key Result

Theorem 1

Suppose the training data $X_{trn}$ and test data $X_{tst}$ satisfy Assumption data:1 and the noise $A_{trn}, A_{tst}$ satisfy Assumption noise:1. Let $\mu$ be the regularization parameter. Then for the under-parameterized regime (i.e., $c < 1$) for the solution $W_{opt}$ to Problem prob:denoise, th where

Figures (17)

Figure 1: Bias-variance trade-off and double descent.
Figure 2: Figure showing the theoretical risk curve from Theorem \ref{['thm:result']} and empirical values in the data scaling regime for different values of $\mu$ [(L) $\mu = 0.1$, (C) $\mu = 1$, (R) $\mu=2$]. Here $\sigma_{trn}=\sqrt{n}, \sigma_{tst} = \sqrt{n_{tst}}, d=1000, n_{tst} = 1000$. For each empirical point, we ran at least 100 trials. More details can be found in Appendix \ref{['app:numerical']}.
Figure 3: Risk for the ablation experiment. Left: Empirical Expected Risk when using $\tilde{A}$ for the noise. Right: Empirical risk when we replace $\hat{V}$ with a random orthogonal matrix.
Figure 4: Figure showing the theoretical risk curve from Theorem \ref{['thm:result']} and empirical values in the parameter scaling regime for different values of $\mu$ [(L) $\mu=0.1$, (C) $\mu=0.2$, (R) $\mu=0.5$]. Here, only $\mu=0.1$ has a local peak. Here $n = n_{tst} = 1000$ and $\sigma_{trn} = \sigma_{tst} = \sqrt{1000}$. Each empirical point is an average of 100 trials.
Figure 5: Figure showing generalization error versus $\mathbb{E}\left[\|W_{opt}\|_F^2\right]$ for the parameter scaling regime for three different values of $\mu$.
...and 12 more figures

Theorems & Definitions (65)

Definition 1
Definition 2
Theorem 1: Generalization Error Formula
proof : Sketch
Theorem 2: Under-Parameterized Peak
Theorem 3: $\|W_{opt}\|_F$ Peak
Theorem 4: Under-parametrized Peak
Theorem 5: Training Error
Proposition 1: Optimal $\sigma_{trn}$
Theorem 5: Under-parametrized Peak
...and 55 more

Least Squares Regression Can Exhibit Under-Parameterized Double Descent

TL;DR

Abstract

Least Squares Regression Can Exhibit Under-Parameterized Double Descent

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (17)

Theorems & Definitions (65)