Table of Contents
Fetching ...

ReLU Neural Networks with Linear Layers are Biased Towards Single- and Multi-Index Models

Suzanna Parkinson, Greg Ongie, Rebecca Willett

TL;DR

This work analyzes how deep ReLU networks with multiple input linear layers impart an inductive bias via the representation cost $R_L(f)$, steering interpolants toward functions with latent low-dimensional structure. By relating $R_L$ to Schatten quasi-norm regularization on the inner weight matrix and to the EGOP, the authors connect depth-induced bias to single- and multi-index models and quantify it through index rank and mixed variation. They establish theoretical bounds showing $R_L$ interpolants increasingly favor low index rank as $L$ grows, and provide finite-sample bounds on the effective index rank of interpolants, along with empirical evidence that adding linear layers improves generalization and aligns learned subspaces with the true latent subspace. The findings suggest linear input layers act as a powerful regularizer that promotes low-rank structure and latent subspace alignment, with practical implications for improving generalization in moderate-sample regimes, while also outlining limitations and directions for extending these ideas to deeper nonlinear architectures.

Abstract

Neural networks often operate in the overparameterized regime, in which there are far more parameters than training samples, allowing the training data to be fit perfectly. That is, training the network effectively learns an interpolating function, and properties of the interpolant affect predictions the network will make on new samples. This manuscript explores how properties of such functions learned by neural networks of depth greater than two layers. Our framework considers a family of networks of varying depths that all have the same capacity but different representation costs. The representation cost of a function induced by a neural network architecture is the minimum sum of squared weights needed for the network to represent the function; it reflects the function space bias associated with the architecture. Our results show that adding additional linear layers to the input side of a shallow ReLU network yields a representation cost favoring functions with low mixed variation -- that is, it has limited variation in directions orthogonal to a low-dimensional subspace and can be well approximated by a single- or multi-index model. This bias occurs because minimizing the sum of squared weights of the linear layers is equivalent to minimizing a low-rank promoting Schatten quasi-norm of a single "virtual" weight matrix. Our experiments confirm this behavior in standard network training regimes. They additionally show that linear layers can improve generalization and the learned network is well-aligned with the true latent low-dimensional linear subspace when data is generated using a multi-index model.

ReLU Neural Networks with Linear Layers are Biased Towards Single- and Multi-Index Models

TL;DR

This work analyzes how deep ReLU networks with multiple input linear layers impart an inductive bias via the representation cost , steering interpolants toward functions with latent low-dimensional structure. By relating to Schatten quasi-norm regularization on the inner weight matrix and to the EGOP, the authors connect depth-induced bias to single- and multi-index models and quantify it through index rank and mixed variation. They establish theoretical bounds showing interpolants increasingly favor low index rank as grows, and provide finite-sample bounds on the effective index rank of interpolants, along with empirical evidence that adding linear layers improves generalization and aligns learned subspaces with the true latent subspace. The findings suggest linear input layers act as a powerful regularizer that promotes low-rank structure and latent subspace alignment, with practical implications for improving generalization in moderate-sample regimes, while also outlining limitations and directions for extending these ideas to deeper nonlinear architectures.

Abstract

Neural networks often operate in the overparameterized regime, in which there are far more parameters than training samples, allowing the training data to be fit perfectly. That is, training the network effectively learns an interpolating function, and properties of the interpolant affect predictions the network will make on new samples. This manuscript explores how properties of such functions learned by neural networks of depth greater than two layers. Our framework considers a family of networks of varying depths that all have the same capacity but different representation costs. The representation cost of a function induced by a neural network architecture is the minimum sum of squared weights needed for the network to represent the function; it reflects the function space bias associated with the architecture. Our results show that adding additional linear layers to the input side of a shallow ReLU network yields a representation cost favoring functions with low mixed variation -- that is, it has limited variation in directions orthogonal to a low-dimensional subspace and can be well approximated by a single- or multi-index model. This bias occurs because minimizing the sum of squared weights of the linear layers is equivalent to minimizing a low-rank promoting Schatten quasi-norm of a single "virtual" weight matrix. Our experiments confirm this behavior in standard network training regimes. They additionally show that linear layers can improve generalization and the learned network is well-aligned with the true latent low-dimensional linear subspace when data is generated using a multi-index model.
Paper Structure (44 sections, 22 theorems, 129 equations, 17 figures)

This paper contains 44 sections, 22 theorems, 129 equations, 17 figures.

Key Result

Lemma 2.1

Suppose $f\in \mathcal{N}_2\left(\mathcal{X}\right)$. Then where $q:=2/(L-1)$ and $\|{\bm{W}}\|_{{\mathcal{S}}^q}$ is the Schatten-$q$ quasi-norm, i.e., the $\ell^q$ quasi-norm of the singular values of ${\bm{W}}$.

Figures (17)

  • Figure 1: Numerical evidence that weight decay promotes unit alignment with more linear layers. Neural networks with $L-1$ linear layers followed by one ReLU layer were trained using SGD with $\ell_2$-regularization (weight decay) to close to zero training loss on the training samples, as shown in black. Pictured in (a)-(c) are the resulting interpolating functions shown as surface plots. Our theory predicts that as the number of linear layers increases, the learned interpolating function will become closer to constant in directions orthogonal to a low-dimensional subspace on which a parsimonious interpolant can be defined.
  • Figure 2: Illustration of learning a low-index-rank function. (a) Heatmap of a rank-1 data generating function $f:\mathbb{R}^2 \rightarrow \mathbb{R}$ and locations of training samples. (b) Interpolant learned with $L=2$ layers, which does not exhibit index-rank-1 structure. (c) Interpolant learned with $L=4$ layers, which closely approximates the index-rank-1 structure of the data-generating function. (d) Result of performing PCA on training features to reduce their dimension to one, followed by learning with $L=2$ layers. Because the PCA subspace depends on the geometry of the training features and not on the geometry of the function, PCA cannot discover the correct principal subspace.
  • Figure 3: Illustration of four functions $f:\mathbb{R}^2 \rightarrow \mathbb{R}$ with mixed variation (\ref{['def:mv']}) decreasing from left to right. All four functions are index rank 2 according to \ref{['def:rank']}, but the functions on the right with smaller mixed variation are closer to being index rank 1 because they vary significantly more in one direction than another.
  • Figure 4: Existence of rank deficient interpolants. Left panel shows 32 training samples generated by the index-rank-2 function $f^*(x_1,x_2) = [x_1]_+-[x_2]_+$, for which $R_2(f^*) = 2$. Middle panel shows $f_{1}$, the estimated minimal $R_2$-cost index-rank-one interpolant of the training samples, for which $R_2(f_1) \approx 287.5$. Right panel shows the 1D profile of the rank-one interpolant in the middle panel.
  • Figure 5: Adding linear layers improves generalization on multi-index models. In-distribution generalization performance of networks trained with or without extra linear layers on data from a single-index model (left) or multi-index model (center, right) with varying amounts of label noise. Models trained with extra linear layers demonstrate significantly improved generalization in this setting. (Bottom) Even in the presence of label noise ($\sigma > 0$), the generalization error of models with extra linear layers quickly approaches the irreducible error $\sigma^2$ as the number of training samples ($n$) increases. See \ref{['sec:experiments']} and \ref{['app:experiment details']} for training details.
  • ...and 12 more figures

Theorems & Definitions (47)

  • Lemma 2.1
  • proof
  • Definition 3.1: Index rank
  • Definition 3.2: Mixed variation
  • Theorem 4.1
  • Corollary 4.2
  • Corollary 4.3
  • Definition 4.4: Effective index rank
  • Definition 4.5: Interpolation cost
  • Theorem 4.6: Effective index ranks of minimal-cost interpolants.
  • ...and 37 more