How Neural Networks Learn the Support is an Implicit Regularization Effect of SGD

Pierfrancesco Beneventano; Andrea Pinto; Tomaso Poggio

How Neural Networks Learn the Support is an Implicit Regularization Effect of SGD

Pierfrancesco Beneventano, Andrea Pinto, Tomaso Poggio

TL;DR

This work investigates how optimization dynamics influence the identification of the target function's support in neural networks. Through theoretical analysis of linear and diagonal networks under gradient descent and mini-batch SGD, the authors show that SGD induces a second-order implicit regularization that scales with $\eta/b$, causing the first-layer weights associated with irrelevant inputs to shrink and enabling early identification of the target support; GD without explicit regularization, in contrast, often fails to localize the first-layer support. The authors decompose inputs into relevant and irrelevant directions and demonstrate a two-phase training dynamics: Phase 1 minimizes the loss and learns the target, and Phase 2, driven by implicit SGD regularization, aligns the first layer with the support, especially under oscillatory training conditions. Empirical results on synthetic targets and standard datasets (e.g., MNIST, CIFAR-10) corroborate the theory and reveal that smaller batch sizes can enhance feature interpretability and reduce initialization sensitivity. Extensions to nonlinear activations, notably ReLU, are discussed, with weight decay shown to benignly accelerate the suppression of irrelevant weights.

Abstract

We investigate the ability of deep neural networks to identify the support of the target function. Our findings reveal that mini-batch SGD effectively learns the support in the first layer of the network by shrinking to zero the weights associated with irrelevant components of input. In contrast, we demonstrate that while vanilla GD also approximates the target function, it requires an explicit regularization term to learn the support in the first layer. We prove that this property of mini-batch SGD is due to a second-order implicit regularization effect which is proportional to $η/ b$ (step size / batch size). Our results are not only another proof that implicit regularization has a significant impact on training optimization dynamics but they also shed light on the structure of the features that are learned by the network. Additionally, they suggest that smaller batches enhance feature interpretability and reduce dependency on initialization.

How Neural Networks Learn the Support is an Implicit Regularization Effect of SGD

TL;DR

, causing the first-layer weights associated with irrelevant inputs to shrink and enabling early identification of the target support; GD without explicit regularization, in contrast, often fails to localize the first-layer support. The authors decompose inputs into relevant and irrelevant directions and demonstrate a two-phase training dynamics: Phase 1 minimizes the loss and learns the target, and Phase 2, driven by implicit SGD regularization, aligns the first layer with the support, especially under oscillatory training conditions. Empirical results on synthetic targets and standard datasets (e.g., MNIST, CIFAR-10) corroborate the theory and reveal that smaller batch sizes can enhance feature interpretability and reduce initialization sensitivity. Extensions to nonlinear activations, notably ReLU, are discussed, with weight decay shown to benignly accelerate the suppression of irrelevant weights.

Abstract

(step size / batch size). Our results are not only another proof that implicit regularization has a significant impact on training optimization dynamics but they also shed light on the structure of the features that are learned by the network. Additionally, they suggest that smaller batches enhance feature interpretability and reduce dependency on initialization.

Paper Structure (87 sections, 15 theorems, 94 equations, 19 figures)

This paper contains 87 sections, 15 theorems, 94 equations, 19 figures.

Introduction
The problem
Our contribution
Theoretical Contributions.
Empirical experiments and appendix.
Background
The targets are inherently low-dimensional.
Finding the support.
SGD implicitly induces regularization.
Identification of the support in the first layer
Convergence: only with mini-batch SGD
GD does not learn the support in the first layer
SGD does it because of its implicit bias
The reason is implicit SGD regularization
The 2 phases of the dynamics
...and 72 more sections

Key Result

Theorem 1

Let $\sigma$ be the identity, the loss be MSE, Assumption ass:1 hold. For every entry $i,j$ of the weights of the first layer corresponding to the irrelevant components $\mathbf{W}_1[:, r+1:]$ there exists $a,c \geq 0$ such that one step of GD or of mini-batch SGD with batch size $b \in \mathbb{N}$,

Figures (19)

Figure 1: First layer weights and their Gram matrices at initialization and at convergence for GD and SGDs. All trained from same initialization on target $y(\mathbf{x}) = \sin (\sum_{i<r} x_i )$ with sparse support $r < d$. While all network achieve similar performance on loss (in brackets on the right), SGD with smaller batches identifies the support in the first layer weights.
Figure 2: Eigenvalues histogram for the weights $\mathbf{W}_1$ of the first layer on MNIST dataset.
Figure 3: Norm of irrelevant weights of $\mathbf{W}_1$ over time in a linear network.
Figure 4: Convergence of $a,b$ in Illustrative Example of Implicit SGD Regularization.
Figure 5: Where MLP model looks at on MNIST dataset.
...and 14 more figures

Theorems & Definitions (17)

Theorem 1
Proposition 1
Proposition 2: Corollary of Theorem \ref{['theo:time']}
Theorem 2
Proposition 3
Proposition 4
Remark 1
Proposition 5
Proposition 6: Negative result for GD, 2
Remark 2
...and 7 more

How Neural Networks Learn the Support is an Implicit Regularization Effect of SGD

TL;DR

Abstract

How Neural Networks Learn the Support is an Implicit Regularization Effect of SGD

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (19)

Theorems & Definitions (17)