Table of Contents
Fetching ...

The Low-Rank Simplicity Bias in Deep Networks

Minyoung Huh, Hossein Mobahi, Richard Zhang, Brian Cheung, Pulkit Agrawal, Phillip Isola

TL;DR

This paper investigates why over-parameterized deep networks generalize well by identifying a low-rank simplicity bias: deeper networks preferentially map data to low effective-rank embeddings. It introduces effective rank as a spectral-entropy measure of the embedding Gram matrix and shows, across linear and nonlinear models, that depth biases the learned representations toward lower rank, observable both at initialization and after training and across various optimizers. The authors connect these empirical findings to random matrix theory in the linear case and demonstrate that linearly over-parameterizing networks can induce a beneficial low-rank bias, improving generalization on CIFAR and ImageNet without increasing modeling capacity. They discuss residual connections, the scope of the bias beyond gradient-based optimization, and the broader implications for architectural design and regularization. Overall, the work highlights parameterization, not just optimization, as a key factor shaping the inductive bias of deep networks toward simpler embeddings with practical generalization benefits.

Abstract

Modern deep neural networks are highly over-parameterized compared to the data on which they are trained, yet they often generalize remarkably well. A flurry of recent work has asked: why do deep networks not overfit to their training data? In this work, we make a series of empirical observations that investigate and extend the hypothesis that deeper networks are inductively biased to find solutions with lower effective rank embeddings. We conjecture that this bias exists because the volume of functions that maps to low effective rank embedding increases with depth. We show empirically that our claim holds true on finite width linear and non-linear models on practical learning paradigms and show that on natural data, these are often the solutions that generalize well. We then show that the simplicity bias exists at both initialization and after training and is resilient to hyper-parameters and learning methods. We further demonstrate how linear over-parameterization of deep non-linear models can be used to induce low-rank bias, improving generalization performance on CIFAR and ImageNet without changing the modeling capacity.

The Low-Rank Simplicity Bias in Deep Networks

TL;DR

This paper investigates why over-parameterized deep networks generalize well by identifying a low-rank simplicity bias: deeper networks preferentially map data to low effective-rank embeddings. It introduces effective rank as a spectral-entropy measure of the embedding Gram matrix and shows, across linear and nonlinear models, that depth biases the learned representations toward lower rank, observable both at initialization and after training and across various optimizers. The authors connect these empirical findings to random matrix theory in the linear case and demonstrate that linearly over-parameterizing networks can induce a beneficial low-rank bias, improving generalization on CIFAR and ImageNet without increasing modeling capacity. They discuss residual connections, the scope of the bias beyond gradient-based optimization, and the broader implications for architectural design and regularization. Overall, the work highlights parameterization, not just optimization, as a key factor shaping the inductive bias of deep networks toward simpler embeddings with practical generalization benefits.

Abstract

Modern deep neural networks are highly over-parameterized compared to the data on which they are trained, yet they often generalize remarkably well. A flurry of recent work has asked: why do deep networks not overfit to their training data? In this work, we make a series of empirical observations that investigate and extend the hypothesis that deeper networks are inductively biased to find solutions with lower effective rank embeddings. We conjecture that this bias exists because the volume of functions that maps to low effective rank embedding increases with depth. We show empirically that our claim holds true on finite width linear and non-linear models on practical learning paradigms and show that on natural data, these are often the solutions that generalize well. We then show that the simplicity bias exists at both initialization and after training and is resilient to hyper-parameters and learning methods. We further demonstrate how linear over-parameterization of deep non-linear models can be used to induce low-rank bias, improving generalization performance on CIFAR and ImageNet without changing the modeling capacity.

Paper Structure

This paper contains 28 sections, 1 theorem, 25 equations, 24 figures, 1 table.

Key Result

Theorem 3.1

Let $\rho$ be the effective rank measure defined in Definition erank. For a linear neural network with $d$-layers, where the parameters are drawn from the same Normal distribution $\{ W_i \}_{i=1}^d\sim \mathcal{W}$, the effective rank of the weights monotonically decreases when increasing the numbe Proof. See Appendix app:erank.

Figures (24)

  • Figure 1: Deep nets struggle to fit high-rank linear functions: We report the training loss of neural networks of different depths optimized to solve linear regression. The rank of the underlying linear function is varied in the range $[1, 64]$. While shallow networks achieve zero training loss, the training loss worsens with increased depth and task rank (see Appendix \ref{['app:train']} for training details).
  • Figure 2: Deep networks are biased toward low effective rank: The approximated probability density function (PDF) of the effective rank $\rho$ over the Gram matrix is computed from features of the networks. The Gram matrix is computed with $256$ random inputs, and we use $4096$ network parameter samples to approximate the cumulative distribution function. The CDF is used to compute the PDF via the finite difference method. We apply savitzky1964smoothing filter to smoothen out the approximation. There exists more probability mass for lower effective rank embeddings when adding more layers. The experiment is repeated for both normal and uniform distributions. For linear networks, the effective parameters are fixed across depth, while for non-linear networks, this is not the case.
  • Figure 3: Distribution of non-linear nets at convergence: Rank distribution after training the network to zero-training error with gradient descent. The dotted line indicates the initial distribution, the solid line indicates the converged distribution, and the green line indicates the task rank. Despite all models having the same functional capacity, the model's ability to find the underlying solution depends on the original parameterization of the network. Despite all models achieving zero-training error, models of different depth recover different underlying solutions. In this experiment, the model with a depth of $4$ or $8$ finds a better generalizing solution on a held-out set than models with more or fewer layers.
  • Figure 4: Gram matrices of networks: Gram matrices of neural networks trained with various non-linearities and depth. Since increasing the number of non-linear layers increases the functional expressivity of the network, the Gram matrix is computed using the cosine distance on the features of the test set near zero-training loss. Increasing the number of layers decreases the effective rank of the Gram matrix on a variety of non-linear activation functions. The Gram matrix is hierarchically clustered (rokach2005clustering) for visualization. We observe the emergence of block structures in the Gram matrix as we increase the number of layers, indicating that the embeddings become lower rank with depth.
  • Figure 5: Low-rank bias & optimizers: Least-squares trained on linear neural networks using various optimization methods. The rank of the converged Gram matrix is correlated with the depth of the network. The experiment is repeated 5 times. Except for $\mathsf{Random\;Search}$, all models achieve $0$ training loss. While the solution achieved depends on the optimizer, the underlying low-rank bias of depth persists across optimizers and is not specific to gradient descent. All models have the same functional expressivity.
  • ...and 19 more figures

Theorems & Definitions (8)

  • Definition 2.1: Effective rank
  • Conjecture 3.1
  • Theorem 3.1
  • Definition D.1: Effective rank
  • Definition D.2: Threshold rank
  • Definition D.3: Stable rank
  • Definition D.4: Nuclear norm
  • Definition G.1: Differential effective rank