Implicit Regularization Paths of Weighted Neural Representations

Jin-Hong Du; Pratik Patil

Implicit Regularization Paths of Weighted Neural Representations

Jin-Hong Du, Pratik Patil

TL;DR

It is shown that ridge estimators trained on weighted features along the same path are asymptotically equivalent when evaluated against test vectors of bounded norms, and an efficient cross-validation method for tuning is developed.

Abstract

We study the implicit regularization effects induced by (observation) weighting of pretrained features. For weight and feature matrices of bounded operator norms that are infinitesimally free with respect to (normalized) trace functionals, we derive equivalence paths connecting different weighting matrices and ridge regularization levels. Specifically, we show that ridge estimators trained on weighted features along the same path are asymptotically equivalent when evaluated against test vectors of bounded norms. These paths can be interpreted as matching the effective degrees of freedom of ridge estimators fitted with weighted features. For the special case of subsampling without replacement, our results apply to independently sampled random features and kernel features and confirm recent conjectures (Conjectures 7 and 8) of the authors on the existence of such paths in Patil et al. We also present an additive risk decomposition for ensembles of weighted estimators and show that the risks are equivalent along the paths when the ensemble size goes to infinity. As a practical consequence of the path equivalences, we develop an efficient cross-validation method for tuning and apply it to subsampled pretrained representations across several models (e.g., ResNet-50) and datasets (e.g., CIFAR-100).

Implicit Regularization Paths of Weighted Neural Representations

TL;DR

Abstract

Paper Structure (36 sections, 12 theorems, 85 equations, 9 figures, 2 tables, 1 algorithm)

This paper contains 36 sections, 12 theorems, 85 equations, 9 figures, 2 tables, 1 algorithm.

Introduction
Summary of results and paper outline
Related literature
Preliminaries
Implicit regularization paths
Examples of weight matrices
Examples of feature matrices
Prediction risk asymptotics and risk estimation
Optimal oracle tuning
Data-dependent tuning
Validation on real-world datasets
Limitations and outlook
Technical background
Basics of free probability theory
Useful transforms and their relationships
...and 21 more sections

Key Result

Theorem 1

For $\bm{G}_{\bm{I}}\in\mathbb{R}^{n\times n}$, suppose that the subsampling operator $\bm{W}\in\mathbb{R}^{n\times n}$ satisfies cond:sketch-observation and $\limsup \| \bm{y} \|_2^2 / n < \infty$ as $n \to \infty$. For any $\mu > - \liminf_{n\to\infty} \lambda_{\min}^+(\bm{G}_{\bm{I}})$, let $\lam where ${\mathcal{S}}_{\bm{W}^\top \bm{W}}$ is the $S$-transform of the operator $\bm{W}^\top \bm{W}

Figures (9)

Figure 1: Equivalence under subsampling. The left panel shows the heatmap of degrees of freedom, and the right panel shows the random projection $\mathbb{E}_{\bm{W}}[\bm{a}^{\top}\widehat{\bm{\beta}}_{\bm{W},\lambda}]$ where $\bm{a}\sim\mathcal{N}({\bm 0}_p,\bm{I}_p/p)$. In both heatmaps, the red color lines indicate the predicted paths using \ref{['eq:subsample-wor-path']}, and the black dashed lines indicate the empirical paths by matching empirical degrees of freedom. The data is generated according to \ref{['subsec:simu']} with $n=10000$ and $p=1000$, and the results are averaged over $M=100$ random weight matrices $\bm{W}$.
Figure 2: Equivalence of degrees of freedom for various feature structures under subsampling. The three panels correspond to linear features, random features with ReLU activation function (2-layer), and kernel features (polynomial kernel with degree 3 and without intercept), respectively. In all heatmaps, the red color lines indicate the predicted paths using \ref{['eq:subsample-wor-path']}, and the black dashed lines indicate the empirical paths by matching the empirical degrees of freedom. The data is generated according to \ref{['subsec:simu']} with $n=5000$ and $p=500$, and the results are averaged over $M=100$ random weight matrices $\bm{W}$.
Figure 3: Equivalence in pretrained features of pretrained ResNet-50 on Flowers-102 datasets.
Figure 4: Risk estimation by corrected and extrapolated generalized cross-validation. The risk estimates are computed based on $M_0=25$ base estimators using \ref{['alg:cross-validation']} with $\lambda=10^{-3}$.
Figure 5: Equivalence under bootstrapping. The left panel shows the heatmap of degrees of freedom, and the right panel shows the random projection $\mathbb{E}_{\bm{W}}[\bm{a}^{\top}\widehat{\bm{\beta}}_{\bm{W},\lambda}]$ where $\bm{a}\sim\mathcal{N}({\bm 0}_p,\bm{I}_p/p)$. In both heatmaps, the red lines indicate the predicted paths using \ref{['eq:subsample-wor-path']}, and the black dashed lines indicate the empirical paths obtained by matching empirical degrees of freedom. Despite the complexity of the theoretical path for bootstrapping, we observe that the empirical paths closely resemble it. Therefore, the theoretical path for sampling without replacement from \ref{['eq:subsample-wor-path']} serves as a good approximation.
...and 4 more figures

Theorems & Definitions (23)

Theorem 1: Implicit regularization of weighted representations
Theorem 2: Regularization paths due to subsampling
Proposition 3: Regularization paths with linear features
Proposition 4: Regularization paths with kernel features
Proposition 5: Regularization paths with random features
Theorem 6: Risk equivalence along the path
Proposition 7: Optimal subsample ratio
Definition 8: Non-commutative algebra
Definition 9: Non-commutative probability space
Definition 10: Moments
...and 13 more

Implicit Regularization Paths of Weighted Neural Representations

TL;DR

Abstract

Implicit Regularization Paths of Weighted Neural Representations

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (23)