Table of Contents
Fetching ...

Benignity of loss landscape with weight decay requires both large overparametrization and initialization

Etienne Boursier, Matthew Bowditch, Matthias Englert, Ranko Lazic

TL;DR

This work analyzes the loss landscape of $\ell_2$-regularized two-layer ReLU networks, showing that benignity (absence of spurious local minima) emerges only under substantial overparameterization, namely $m = \Omega\left(\min(n^d,2^n)\right)$ (up to a $\log(n/\varepsilon)$ factor). Using activation-pattern geometry and a convex reformulation, the authors prove that almost all activation cones contain a global minimum and no bad local minima for any $\lambda>0$, with the result tight in the orthogonal-data case. They then connect landscapes to optimization dynamics, demonstrating that benign landscapes mainly inform training behavior in the large initialization (NTK-like) regime, while small initializations can still converge to bad stationary points due to the implicit bias of optimization. For orthogonal data, the threshold is shown to be necessary, and experiments corroborate the theoretical predictions, highlighting the crucial roles of initialization scale and data geometry. Overall, the findings illuminate why regularization raises the required level of overparametrization and when landscape-based guarantees translate into actual optimization performance.

Abstract

The optimization of neural networks under weight decay remains poorly understood from a theoretical standpoint. While weight decay is standard practice in modern training procedures, most theoretical analyses focus on unregularized settings. In this work, we investigate the loss landscape of the $\ell_2$-regularized training loss for two-layer ReLU networks. We show that the landscape becomes benign -- i.e., free of spurious local minima -- under large overparametrization, specifically when the network width $m$ satisfies $m \gtrsim \min(n^d, 2^n)$, where $n$ is the number of data points and $d$ the input dimension. More precisely in this regime, almost all constant activation regions contain a global minimum and no spurious local minima. We further show that this level of overparametrization is not only sufficient but also necessary via the example of orthogonal data. Finally, we demonstrate that such loss landscape results primarily hold relevance in the large initialization regime. In contrast, for small initializations -- corresponding to the feature learning regime -- optimization can still converge to spurious local minima, despite the global benignity of the landscape.

Benignity of loss landscape with weight decay requires both large overparametrization and initialization

TL;DR

This work analyzes the loss landscape of -regularized two-layer ReLU networks, showing that benignity (absence of spurious local minima) emerges only under substantial overparameterization, namely (up to a factor). Using activation-pattern geometry and a convex reformulation, the authors prove that almost all activation cones contain a global minimum and no bad local minima for any , with the result tight in the orthogonal-data case. They then connect landscapes to optimization dynamics, demonstrating that benign landscapes mainly inform training behavior in the large initialization (NTK-like) regime, while small initializations can still converge to bad stationary points due to the implicit bias of optimization. For orthogonal data, the threshold is shown to be necessary, and experiments corroborate the theoretical predictions, highlighting the crucial roles of initialization scale and data geometry. Overall, the findings illuminate why regularization raises the required level of overparametrization and when landscape-based guarantees translate into actual optimization performance.

Abstract

The optimization of neural networks under weight decay remains poorly understood from a theoretical standpoint. While weight decay is standard practice in modern training procedures, most theoretical analyses focus on unregularized settings. In this work, we investigate the loss landscape of the -regularized training loss for two-layer ReLU networks. We show that the landscape becomes benign -- i.e., free of spurious local minima -- under large overparametrization, specifically when the network width satisfies , where is the number of data points and the input dimension. More precisely in this regime, almost all constant activation regions contain a global minimum and no spurious local minima. We further show that this level of overparametrization is not only sufficient but also necessary via the example of orthogonal data. Finally, we demonstrate that such loss landscape results primarily hold relevance in the large initialization regime. In contrast, for small initializations -- corresponding to the feature learning regime -- optimization can still converge to spurious local minima, despite the global benignity of the landscape.

Paper Structure

This paper contains 31 sections, 15 theorems, 106 equations, 6 figures.

Key Result

Theorem 1

Let $\varepsilon\in(0,1)$. If $m=\Omega\left(\min(n^{d},2^n)\log(\frac{n}{\varepsilon})\right)$, then for any $\lambda>0$, in all except at most an $\varepsilon$ fraction of non-empty activation cones $\mathcal{C}^A$ it simultaneously holds:

Figures (6)

  • Figure 1: Proportion of activation cones containing global minima (blue) and bad local minima (orange) across varying $m$, $n$, and $d$. The vertical dotted line corresponds to the number of non-empty neuron activation patterns, of which there are $4\cdot \sum_{i=0}^{d-1} \binom{n-1}{i}=\mathcal{O}_{\!\!}\left(\min(2^n,n^d)\right)$ many.
  • Figure 2: We sample a nonempty activation cone by generating a random network and observing the activation patterns that the neurons of the random network have. The plot shows the proportion of activation cones containing global minima (blue) and bad local minima (orange) across varying $m$, $n$, and $d$, when sampled in this way. The vertical dotted line corresponds to the number of non-empty neuron activation patterns, of which there are $4\cdot \sum_{i=0}^{d-1} \binom{n-1}{i}=\mathcal{O}_{\!\!}\left(\min(2^n,n^d)\right)$ many.
  • Figure 3: Proportion of activation cones containing global minima (blue) and local minima (orange) across varying $m$, $n$, and $d$ for orthogonal datasets. The vertical dotted line corresponds to the number of non-empty neuron activation patterns, of which there are $4\cdot \sum_{i=0}^{d-1} \binom{n-1}{i}=\mathcal{O}_{\!\!}\left(\min(2^n,n^d)\right)$ many.
  • Figure 4: The square of the Euclidean norm of all network weights after training has finished, starting with different initialization scales $\alpha$ for $d$ dimensional data. The shaded areas correspond to the min/max deviations observed over 5 different runs. For each run, both the dataset and the initial weights are drawn from the distribution discussed in \ref{['app:expedetailsextended']}. A network only consisting of the teacher neuron has a squared Euclidean norm of $2$. We see that for large initialization scales, we converge to a network of notably smaller size.
  • Figure 5: The difference between the final regularized loss of the trained network and the optimal regularized loss, for different initialization scales $\alpha$ for $d$ dimensional data. The shaded areas correspond to the min/max deviations observed over 5 different runs. For each run, both the dataset and the initial weights are drawn from the distribution discussed in \ref{['app:expedetailsextended']}.
  • ...and 1 more figures

Theorems & Definitions (28)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Lemma 1
  • Lemma 2: Lemma 11 by karhadkar2024mildly
  • proof
  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • ...and 18 more