Layered Models can "Automatically" Regularize and Discover Low-Dimensional Structures via Feature Learning
Yunlu Chen, Yang Li, Keli Liu, Feng Ruan
TL;DR
The paper investigates a two-layer nonparametric regression model where the input is first linearly projected and then passed through a nonlinear RKHS predictor, jointly optimizing over the first-layer matrix $U$ (or $\Sigma=UU^T$) and the second-layer function $f$. It shows that, under mild conditions, the population minimizer aligns its learned subspace with the central mean subspace $S_*$ and that, for small ridge strength $\lambda$, the learned subspace exactly recovers $S_*$; in finite samples, the empirical minimizers remain low-rank and consistently estimate $S_*$ and its dimension without explicit low-rank penalties. A key technical insight is a sharpness property of the population objective, which, together with uniform convergence of objectives and gradients, ensures that the finite-sample solutions inherit the population’s low-rank structure. The authors also establish an equivalence between the layered formulation and a kernel-learning formulation with a learnable kernel $k_\Sigma$ or $k_q$, demonstrate the necessity of rotationally invariant kernels for the phenomenon, and illustrate the effect with synthetic and real-data experiments (e.g., SVHN and automobile data) that highlight automatic feature learning and dimensionality reduction through the learned bottom layer. The results offer a new perspective on implicit regularization in layered models, with implications for dimension reduction, interpretability, and efficient representation learning without conventional penalties.
Abstract
Layered models like neural networks appear to extract key features from data through empirical risk minimization, yet the theoretical understanding for this process remains unclear. Motivated by these observations, we study a two-layer nonparametric regression model where the input undergoes a linear transformation followed by a nonlinear mapping to predict the output, mirroring the structure of two-layer neural networks. In our model, both layers are optimized jointly through empirical risk minimization, with the nonlinear layer modeled by a reproducing kernel Hilbert space induced by a rotation and translation invariant kernel, regularized by a ridge penalty. Our main result shows that the two-layer model can "automatically" induce regularization and facilitate feature learning. Specifically, the two-layer model promotes dimensionality reduction in the linear layer and identifies a parsimonious subspace of relevant features -- even without applying any norm penalty on the linear layer. Notably, this regularization effect arises directly from the model's layered structure, independent of optimization dynamics. More precisely, assuming the covariates have nonzero explanatory power for the response only through a low dimensional subspace (central mean subspace), the linear layer consistently estimates both the subspace and its dimension. This demonstrates that layered models can inherently discover low-complexity solutions relevant for prediction, without relying on conventional regularization methods. Real-world data experiments further demonstrate the persistence of this phenomenon in practice.
