Table of Contents
Fetching ...

Implicit Regularization Paths of Weighted Neural Representations

Jin-Hong Du, Pratik Patil

TL;DR

It is shown that ridge estimators trained on weighted features along the same path are asymptotically equivalent when evaluated against test vectors of bounded norms, and an efficient cross-validation method for tuning is developed.

Abstract

We study the implicit regularization effects induced by (observation) weighting of pretrained features. For weight and feature matrices of bounded operator norms that are infinitesimally free with respect to (normalized) trace functionals, we derive equivalence paths connecting different weighting matrices and ridge regularization levels. Specifically, we show that ridge estimators trained on weighted features along the same path are asymptotically equivalent when evaluated against test vectors of bounded norms. These paths can be interpreted as matching the effective degrees of freedom of ridge estimators fitted with weighted features. For the special case of subsampling without replacement, our results apply to independently sampled random features and kernel features and confirm recent conjectures (Conjectures 7 and 8) of the authors on the existence of such paths in Patil et al. We also present an additive risk decomposition for ensembles of weighted estimators and show that the risks are equivalent along the paths when the ensemble size goes to infinity. As a practical consequence of the path equivalences, we develop an efficient cross-validation method for tuning and apply it to subsampled pretrained representations across several models (e.g., ResNet-50) and datasets (e.g., CIFAR-100).

Implicit Regularization Paths of Weighted Neural Representations

TL;DR

It is shown that ridge estimators trained on weighted features along the same path are asymptotically equivalent when evaluated against test vectors of bounded norms, and an efficient cross-validation method for tuning is developed.

Abstract

We study the implicit regularization effects induced by (observation) weighting of pretrained features. For weight and feature matrices of bounded operator norms that are infinitesimally free with respect to (normalized) trace functionals, we derive equivalence paths connecting different weighting matrices and ridge regularization levels. Specifically, we show that ridge estimators trained on weighted features along the same path are asymptotically equivalent when evaluated against test vectors of bounded norms. These paths can be interpreted as matching the effective degrees of freedom of ridge estimators fitted with weighted features. For the special case of subsampling without replacement, our results apply to independently sampled random features and kernel features and confirm recent conjectures (Conjectures 7 and 8) of the authors on the existence of such paths in Patil et al. We also present an additive risk decomposition for ensembles of weighted estimators and show that the risks are equivalent along the paths when the ensemble size goes to infinity. As a practical consequence of the path equivalences, we develop an efficient cross-validation method for tuning and apply it to subsampled pretrained representations across several models (e.g., ResNet-50) and datasets (e.g., CIFAR-100).
Paper Structure (36 sections, 12 theorems, 85 equations, 9 figures, 2 tables, 1 algorithm)

This paper contains 36 sections, 12 theorems, 85 equations, 9 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

For $\bm{G}_{\bm{I}}\in\mathbb{R}^{n\times n}$, suppose that the subsampling operator $\bm{W}\in\mathbb{R}^{n\times n}$ satisfies cond:sketch-observation and $\limsup \| \bm{y} \|_2^2 / n < \infty$ as $n \to \infty$. For any $\mu > - \liminf_{n\to\infty} \lambda_{\min}^+(\bm{G}_{\bm{I}})$, let $\lam where ${\mathcal{S}}_{\bm{W}^\top \bm{W}}$ is the $S$-transform of the operator $\bm{W}^\top \bm{W}

Figures (9)

  • Figure 1: Equivalence under subsampling. The left panel shows the heatmap of degrees of freedom, and the right panel shows the random projection $\mathbb{E}_{\bm{W}}[\bm{a}^{\top}\widehat{\bm{\beta}}_{\bm{W},\lambda}]$ where $\bm{a}\sim\mathcal{N}({\bm 0}_p,\bm{I}_p/p)$. In both heatmaps, the red color lines indicate the predicted paths using \ref{['eq:subsample-wor-path']}, and the black dashed lines indicate the empirical paths by matching empirical degrees of freedom. The data is generated according to \ref{['subsec:simu']} with $n=10000$ and $p=1000$, and the results are averaged over $M=100$ random weight matrices $\bm{W}$.
  • Figure 2: Equivalence of degrees of freedom for various feature structures under subsampling. The three panels correspond to linear features, random features with ReLU activation function (2-layer), and kernel features (polynomial kernel with degree 3 and without intercept), respectively. In all heatmaps, the red color lines indicate the predicted paths using \ref{['eq:subsample-wor-path']}, and the black dashed lines indicate the empirical paths by matching the empirical degrees of freedom. The data is generated according to \ref{['subsec:simu']} with $n=5000$ and $p=500$, and the results are averaged over $M=100$ random weight matrices $\bm{W}$.
  • Figure 3: Equivalence in pretrained features of pretrained ResNet-50 on Flowers-102 datasets.
  • Figure 4: Risk estimation by corrected and extrapolated generalized cross-validation. The risk estimates are computed based on $M_0=25$ base estimators using \ref{['alg:cross-validation']} with $\lambda=10^{-3}$.
  • Figure 5: Equivalence under bootstrapping. The left panel shows the heatmap of degrees of freedom, and the right panel shows the random projection $\mathbb{E}_{\bm{W}}[\bm{a}^{\top}\widehat{\bm{\beta}}_{\bm{W},\lambda}]$ where $\bm{a}\sim\mathcal{N}({\bm 0}_p,\bm{I}_p/p)$. In both heatmaps, the red lines indicate the predicted paths using \ref{['eq:subsample-wor-path']}, and the black dashed lines indicate the empirical paths obtained by matching empirical degrees of freedom. Despite the complexity of the theoretical path for bootstrapping, we observe that the empirical paths closely resemble it. Therefore, the theoretical path for sampling without replacement from \ref{['eq:subsample-wor-path']} serves as a good approximation.
  • ...and 4 more figures

Theorems & Definitions (23)

  • Theorem 1: Implicit regularization of weighted representations
  • Theorem 2: Regularization paths due to subsampling
  • Proposition 3: Regularization paths with linear features
  • Proposition 4: Regularization paths with kernel features
  • Proposition 5: Regularization paths with random features
  • Theorem 6: Risk equivalence along the path
  • Proposition 7: Optimal subsample ratio
  • Definition 8: Non-commutative algebra
  • Definition 9: Non-commutative probability space
  • Definition 10: Moments
  • ...and 13 more