Table of Contents
Fetching ...

Asymptotic Smoothing of the Lipschitz Loss Landscape in Overparameterized One-Hidden-Layer ReLU Networks

Saveliy Baturin

TL;DR

This work analyzes the loss landscape of overparameterized one-hidden-layer ReLU networks under convex Lipschitz losses with an $\\ell_1$-regularized output layer. It proves that any two models at a given loss level can be connected by a low-loss path and shows the maximal barrier along such paths vanishes as width grows, establishing an asymptotic smoothing of the landscape. The results extend known connectivity from quadratic losses to a broader class of convex Lipschitz losses and provide a rigorous rate $\\varepsilon = O(m^{-\\zeta})$ for the diminishing energy gap, complemented by empirical evidence on Moons and Wisconsin Breast Cancer datasets using Dynamic String Sampling. Collectively, the findings offer a theoretical explanation for the ease of optimization in overparameterized regimes and demonstrate the practical impact of width on barrier heights. The work highlights both the fundamental geometry of loss sublevel sets and the practical implications for training wide two-layer networks on real-world tasks, including classification with cross-entropy.

Abstract

We study the topology of the loss landscape of one-hidden-layer ReLU networks under overparameterization. On the theory side, we (i) prove that for convex $L$-Lipschitz losses with an $\ell_1$-regularized second layer, every pair of models at the same loss level can be connected by a continuous path within an arbitrarily small loss increase $ε$ (extending a known result for the quadratic loss); (ii) obtain an asymptotic upper bound on the energy gap $ε$ between local and global minima that vanishes as the width $m$ grows, implying that the landscape flattens and sublevel sets become connected in the limit. Empirically, on a synthetic Moons dataset and on the Wisconsin Breast Cancer dataset, we measure pairwise energy gaps via Dynamic String Sampling (DSS) and find that wider networks exhibit smaller gaps; in particular, a permutation test on the maximum gap yields $p_{perm}=0$, indicating a clear reduction in the barrier height.

Asymptotic Smoothing of the Lipschitz Loss Landscape in Overparameterized One-Hidden-Layer ReLU Networks

TL;DR

This work analyzes the loss landscape of overparameterized one-hidden-layer ReLU networks under convex Lipschitz losses with an -regularized output layer. It proves that any two models at a given loss level can be connected by a low-loss path and shows the maximal barrier along such paths vanishes as width grows, establishing an asymptotic smoothing of the landscape. The results extend known connectivity from quadratic losses to a broader class of convex Lipschitz losses and provide a rigorous rate for the diminishing energy gap, complemented by empirical evidence on Moons and Wisconsin Breast Cancer datasets using Dynamic String Sampling. Collectively, the findings offer a theoretical explanation for the ease of optimization in overparameterized regimes and demonstrate the practical impact of width on barrier heights. The work highlights both the fundamental geometry of loss sublevel sets and the practical implications for training wide two-layer networks on real-world tasks, including classification with cross-entropy.

Abstract

We study the topology of the loss landscape of one-hidden-layer ReLU networks under overparameterization. On the theory side, we (i) prove that for convex -Lipschitz losses with an -regularized second layer, every pair of models at the same loss level can be connected by a continuous path within an arbitrarily small loss increase (extending a known result for the quadratic loss); (ii) obtain an asymptotic upper bound on the energy gap between local and global minima that vanishes as the width grows, implying that the landscape flattens and sublevel sets become connected in the limit. Empirically, on a synthetic Moons dataset and on the Wisconsin Breast Cancer dataset, we measure pairwise energy gaps via Dynamic String Sampling (DSS) and find that wider networks exhibit smaller gaps; in particular, a permutation test on the maximum gap yields , indicating a clear reduction in the barrier height.
Paper Structure (17 sections, 3 theorems, 38 equations, 2 tables)

This paper contains 17 sections, 3 theorems, 38 equations, 2 tables.

Key Result

Lemma 1

Let $\mathcal{L}(Y,\hat{Y})$ be convex in $\hat{Y}$ and $L$-Lipschitz, and let $\kappa>0$. For any fixed first-layer $W^1$, consider the optimization problem Then the optimal solution $\theta^*$ satisfies $\|\theta^*\|_1 \le L/\kappa$. In fact, if $\kappa \ge L$ then $\theta^* = 0$ (that is, the minimum is achieved by the zero output weights).

Theorems & Definitions (3)

  • Lemma 1: Control of $\ell_1$-norm
  • Theorem 2: Connectivity for Lipschitz convex loss
  • Theorem 3: Energy gap vanishes with width