High-dimensional analysis of ridge regression for non-identically distributed data with a variance profile

Jérémie Bigot; Issa-Mbenard Dabo; Camille Male

High-dimensional analysis of ridge regression for non-identically distributed data with a variance profile

Jérémie Bigot, Issa-Mbenard Dabo, Camille Male

TL;DR

This work extends high-dimensional ridge regression analysis to data with non-identically distributed predictors modeled by a variance profile $X_n = \Upsilon_n \circ Z_n$. Using random matrix theory and Dyson-type fixed-point equations, the authors derive deterministic equivalents for the ridge degrees of freedom and for the training and predictive risks, capturing how the ratio $p/n$ and the variance profile shape affect the risk. A key result is that the diagonal of the resolvent $Q_p(z)$ has a deterministic equivalent $T_p(z)$, enabling explicit risk formulas in terms of $T_p(-\lambda)$ and its derivative, and revealing that double descent persists under many variance profiles but can exhibit other shapes (e.g., triple or quadruple descent) for certain profiles. Numerical experiments with synthetic variance profiles and MNIST-based data validate the theory and illustrate the practical impact, offering a tool to analyze ridge regression in heteroscedastic, mixture-like settings and guiding extensions to other estimators and correlated data.

Abstract

High-dimensional linear regression has been thoroughly studied in the context of independent and identically distributed data. We propose to investigate high-dimensional regression models for independent but non-identically distributed data. To this end, we suppose that the set of observed predictors (or features) is a random matrix with a variance profile and with dimensions growing at a proportional rate. Assuming a random effect model, we study the predictive risk of the ridge estimator for linear regression with such a variance profile. In this setting, we provide deterministic equivalents of this risk and of the degree of freedom of the ridge estimator. For certain class of variance profile, our work highlights the emergence of the well-known double descent phenomenon in high-dimensional regression for the minimum norm least-squares estimator when the ridge regularization parameter goes to zero. We also exhibit variance profiles for which the shape of this predictive risk differs from double descent. The proofs of our results are based on tools from random matrix theory in the presence of a variance profile that have not been considered so far to study regression models. Numerical experiments are provided to show the accuracy of the aforementioned deterministic equivalents on the computation of the predictive risk of ridge regression. We also investigate the similarities and differences that exist with the standard setting of independent and identically distributed data.

High-dimensional analysis of ridge regression for non-identically distributed data with a variance profile

TL;DR

This work extends high-dimensional ridge regression analysis to data with non-identically distributed predictors modeled by a variance profile

. Using random matrix theory and Dyson-type fixed-point equations, the authors derive deterministic equivalents for the ridge degrees of freedom and for the training and predictive risks, capturing how the ratio

and the variance profile shape affect the risk. A key result is that the diagonal of the resolvent

has a deterministic equivalent

, enabling explicit risk formulas in terms of

and its derivative, and revealing that double descent persists under many variance profiles but can exhibit other shapes (e.g., triple or quadruple descent) for certain profiles. Numerical experiments with synthetic variance profiles and MNIST-based data validate the theory and illustrate the practical impact, offering a tool to analyze ridge regression in heteroscedastic, mixture-like settings and guiding extensions to other estimators and correlated data.

Abstract

Paper Structure (18 sections, 11 theorems, 78 equations, 5 figures)

This paper contains 18 sections, 11 theorems, 78 equations, 5 figures.

Introduction
Main contributions
Organisation of the paper
Publicly available source code
Related works
High-dimensional linear regression from the random matrix perspective
Linear regression for independent but non-identically distributed data
The use of variance profile in RMT
Main results
Degrees of freedom
Deterministic equivalents of the diagonal of the resolvent
Deterministic equivalents of the training and predictive risks
Numerical experiments
Conclusion
Proofs of Proposition \ref{['prop:DOF']} and Lemmas \ref{['lem:decomp']} and \ref{['lem:decomp']}
...and 3 more sections

Key Result

Theorem 2.1

Under Assumptions hyp:Z-hyp:dim_rec, the following limit holds true almost surely where are diagonal matrices of size $p \times p$ and $n \times n$ respectively, whose diagonal elements are the unique solutions of the deterministic system of $p + n$ equations where Moreover, $\frac{1}{p} \mathop{\mathrm{Tr}}\nolimits [T_p(z)]$ and $\frac{1}{n} \mathop{\mathrm{Tr}}\nolimits [\widetilde{T}_n(z)]

Figures (5)

Figure 1: Predictive risk for several variance profiles with $\lambda = 0$. (a) Comparison of constant and quasi doubly stochastic variance profiles with $\alpha = \sigma = 1$, $n = 100$ and $p$ varying from $10$ to $200$. (b) Piecewise constant variance profile with $\gamma_1^2 = 0.0005$, $\gamma_2^2 = 1$, $\alpha = \sigma = 1$, $n = 100$ and $p$ varying from $10$ to $600$.
Figure 2: The Berlin Photo variance profile whose entries correspond to the the green channel of the pixels of a RGB picture taken in Berlin.
Figure 3: Training and predictive risk for several variance profiles with $\lambda$ ranging from $0.1$ to $10$, $\alpha = 1$, $\sigma = 1$, $n = 400$ and $p = 600$. (a) Comparison of $\hat{r}_\lambda^{test}(X_n)$ and $r_\lambda^{test}(X_n)$ for several variance profiles. (b) Comparison of $\hat{r}_\lambda^{train}(X_n)$ and $r_\lambda^{train}(X_n)$ for several variance profiles. The dashed lines correspond to the risks while the solid lines correspond to the deterministic equivalents.
Figure 4: Left: Double descent phenomenon for several variance profiles with $\alpha = \sigma = 1$, $n = 100$ and $p$ varying from $10$ to $600$. Right: Smallest non-zero eigenvalue of $\widehat{\Sigma}_n$ for several variance profiles with $\alpha = \sigma = 1$, $n = 300$ and $p$ varying from $30$ to $1800$.
Figure 5: Study of MNIST dataset whenever $\lambda = 0$. (a) Comparison of real and synthetic data with $\alpha = \sigma = 1$, $p = 784$ and $n$ varying from $78$ to $1568$. The solid lines correspond to the predictive risk while the dashed line corresponds to the deterministic equivalent.(b) Heatmap of the variance of pixels for each class. These heatmaps serve as variance profiles in the mixture model described below.

Theorems & Definitions (23)

Definition 2.1
Theorem 2.1
Proposition 3.1
Definition 3.1
Theorem 3.1
Lemma 3.1
Theorem 3.2
Lemma 3.2
Corollary 3.1
Corollary 3.2
...and 13 more

High-dimensional analysis of ridge regression for non-identically distributed data with a variance profile

TL;DR

Abstract

High-dimensional analysis of ridge regression for non-identically distributed data with a variance profile

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (23)