Table of Contents
Fetching ...

Layered Models can "Automatically" Regularize and Discover Low-Dimensional Structures via Feature Learning

Yunlu Chen, Yang Li, Keli Liu, Feng Ruan

TL;DR

The paper investigates a two-layer nonparametric regression model where the input is first linearly projected and then passed through a nonlinear RKHS predictor, jointly optimizing over the first-layer matrix $U$ (or $\Sigma=UU^T$) and the second-layer function $f$. It shows that, under mild conditions, the population minimizer aligns its learned subspace with the central mean subspace $S_*$ and that, for small ridge strength $\lambda$, the learned subspace exactly recovers $S_*$; in finite samples, the empirical minimizers remain low-rank and consistently estimate $S_*$ and its dimension without explicit low-rank penalties. A key technical insight is a sharpness property of the population objective, which, together with uniform convergence of objectives and gradients, ensures that the finite-sample solutions inherit the population’s low-rank structure. The authors also establish an equivalence between the layered formulation and a kernel-learning formulation with a learnable kernel $k_\Sigma$ or $k_q$, demonstrate the necessity of rotationally invariant kernels for the phenomenon, and illustrate the effect with synthetic and real-data experiments (e.g., SVHN and automobile data) that highlight automatic feature learning and dimensionality reduction through the learned bottom layer. The results offer a new perspective on implicit regularization in layered models, with implications for dimension reduction, interpretability, and efficient representation learning without conventional penalties.

Abstract

Layered models like neural networks appear to extract key features from data through empirical risk minimization, yet the theoretical understanding for this process remains unclear. Motivated by these observations, we study a two-layer nonparametric regression model where the input undergoes a linear transformation followed by a nonlinear mapping to predict the output, mirroring the structure of two-layer neural networks. In our model, both layers are optimized jointly through empirical risk minimization, with the nonlinear layer modeled by a reproducing kernel Hilbert space induced by a rotation and translation invariant kernel, regularized by a ridge penalty. Our main result shows that the two-layer model can "automatically" induce regularization and facilitate feature learning. Specifically, the two-layer model promotes dimensionality reduction in the linear layer and identifies a parsimonious subspace of relevant features -- even without applying any norm penalty on the linear layer. Notably, this regularization effect arises directly from the model's layered structure, independent of optimization dynamics. More precisely, assuming the covariates have nonzero explanatory power for the response only through a low dimensional subspace (central mean subspace), the linear layer consistently estimates both the subspace and its dimension. This demonstrates that layered models can inherently discover low-complexity solutions relevant for prediction, without relying on conventional regularization methods. Real-world data experiments further demonstrate the persistence of this phenomenon in practice.

Layered Models can "Automatically" Regularize and Discover Low-Dimensional Structures via Feature Learning

TL;DR

The paper investigates a two-layer nonparametric regression model where the input is first linearly projected and then passed through a nonlinear RKHS predictor, jointly optimizing over the first-layer matrix (or ) and the second-layer function . It shows that, under mild conditions, the population minimizer aligns its learned subspace with the central mean subspace and that, for small ridge strength , the learned subspace exactly recovers ; in finite samples, the empirical minimizers remain low-rank and consistently estimate and its dimension without explicit low-rank penalties. A key technical insight is a sharpness property of the population objective, which, together with uniform convergence of objectives and gradients, ensures that the finite-sample solutions inherit the population’s low-rank structure. The authors also establish an equivalence between the layered formulation and a kernel-learning formulation with a learnable kernel or , demonstrate the necessity of rotationally invariant kernels for the phenomenon, and illustrate the effect with synthetic and real-data experiments (e.g., SVHN and automobile data) that highlight automatic feature learning and dimensionality reduction through the learned bottom layer. The results offer a new perspective on implicit regularization in layered models, with implications for dimension reduction, interpretability, and efficient representation learning without conventional penalties.

Abstract

Layered models like neural networks appear to extract key features from data through empirical risk minimization, yet the theoretical understanding for this process remains unclear. Motivated by these observations, we study a two-layer nonparametric regression model where the input undergoes a linear transformation followed by a nonlinear mapping to predict the output, mirroring the structure of two-layer neural networks. In our model, both layers are optimized jointly through empirical risk minimization, with the nonlinear layer modeled by a reproducing kernel Hilbert space induced by a rotation and translation invariant kernel, regularized by a ridge penalty. Our main result shows that the two-layer model can "automatically" induce regularization and facilitate feature learning. Specifically, the two-layer model promotes dimensionality reduction in the linear layer and identifies a parsimonious subspace of relevant features -- even without applying any norm penalty on the linear layer. Notably, this regularization effect arises directly from the model's layered structure, independent of optimization dynamics. More precisely, assuming the covariates have nonzero explanatory power for the response only through a low dimensional subspace (central mean subspace), the linear layer consistently estimates both the subspace and its dimension. This demonstrates that layered models can inherently discover low-complexity solutions relevant for prediction, without relying on conventional regularization methods. Real-world data experiments further demonstrate the persistence of this phenomenon in practice.
Paper Structure (66 sections, 37 theorems, 192 equations, 19 figures)

This paper contains 66 sections, 37 theorems, 192 equations, 19 figures.

Key Result

theorem 1

Let Assumptions assumption:existence-of-central-mean-subspace-- assumption:regularity-of-h hold. Fix a unitarily invariant norm $\matrixnorm{\cdot}$ and $M \in (0, \infty)$. Then any minimizer $(\Sigma^*, F^*, \gamma^*)$ of the population objective eqn:kernel-learning-objective-variant-population ob The result also holds verbatim by replacing eqn:kernel-learning-objective-variant-population with e

Figures (19)

  • Figure 1: The covariate $X\sim \normal(0,I)$, and the response $Y=F(X)+\epsilon$, where $F(x)=0.1(x_1 + x_2 + x_3)^3 + \tanh(x_1 + x_3 + x_5)$ and $\epsilon\sim \normal(0,\sigma^2)$ with $\sigma=0.1$. Here $n = 300$ and $d = 50$. We apply gradient descent to optimize the objective \ref{['eqn:kernel-learning-objective']}. For each of $100$ repeated samples of the pair $(X, Y)$, we record the solution matrix $\Sigma_n^*$ across a range of $\lambda$ values from $(0, 3]$. On the left panel, we plot the empirical probability of ${\rm rank}(\Sigma_n^*) \le 2$ against $\lambda$. On the right panel, we display how the rank of the solution $\Sigma_n^*$ changes with different $\lambda$ values, using $5$ examples of $(X, Y)$. Both plots suggest a remarkable inclination of the solution matrix $\Sigma_n^*$ towards low-rankness, even though we don't apply nuclear norm penalties to $\Sigma$ in \ref{['eqn:kernel-learning-objective']} and don't use early stopping techniques.
  • Figure 2: Plots for Claim (i). For each row, the left panel shows the empirical probability of ${\rm rank}(\Sigma_n^*) \le \dim(S_*)$ over $100$ repeated experiments for different $\lambda$ values, while the right panel illustrates the rank of $\Sigma_n^*$ as $\lambda$ varies, using $5$ example pairs of $(X, y)$.
  • Figure 3: Cross-validation error (green curve) and rank of bottom-layer weight matrix $U$ (orange curve) versus $\lambda$: sequence of $\lambda$'s are $0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1$. The plot reveals the emergence of a low-rank phenomenon. Additionally, the standard error of the cross-validation error estimate is shown to facilitate model selection using the "one-standard-error" rule HastieTiFr09.
  • Figure 4: Plot of $\log(y)$ versus $x$ projected on top singular vectors of Automobile data. Red points represent luxury manufacturers (Mercedes-Benz, BMW, Jaguar, and Porsche) and blue points represent economy brands (Honda, Chevrolet, Plymouth, and Subaru). Black points are unclassified, as their manufacturer information is unavailable.
  • Figure 5: Sample images of 8 and 0 from SVHN data.
  • ...and 14 more figures

Theorems & Definitions (70)

  • definition 1: Central Mean Subspace CookLi02
  • theorem 1
  • theorem 2
  • theorem 3
  • lemma 1: Existence and Uniqueness of the Minimizer CuckerSm02
  • lemma 2: Euler-Lagrange
  • lemma 3: Trivial Upper Bound
  • proof
  • lemma 4
  • proof
  • ...and 60 more