Loss Landscape of Shallow ReLU-like Neural Networks: Stationary Points, Saddle Escape, and Network Embedding

Frank Zhengqing Wu; Berfin Simsek; Francois Gaston Ged

Loss Landscape of Shallow ReLU-like Neural Networks: Stationary Points, Saddle Escape, and Network Embedding

Frank Zhengqing Wu, Berfin Simsek, Francois Gaston Ged

TL;DR

This work provides a rigorous, non-smooth analysis of the loss landscape for one-hidden-layer networks with ReLU-like activations by introducing directional stationary points defined via one-sided directional derivatives. It establishes that stationary points without escape neurons are local minima, while in scalar-output cases, the presence of escape neurons prevents local minimality, clarifying saddle-to-saddle training dynamics under vanishing initialization. The paper also extends network embedding to non-differentiable settings, giving conditions under which stationarity and local minimality are preserved during width expansion and detailing how embedding reshapes the stationary structure. Together, these results advance understanding of training dynamics in overparameterized regimes and offer a principled framework for analyzing non-smooth loss landscapes in shallow networks.

Abstract

In this paper, we study the loss landscape of one-hidden-layer neural networks with ReLU-like activation functions trained with the empirical squared loss using gradient descent (GD). We identify the stationary points of such networks, which significantly slow down loss decrease during training. To capture such points while accounting for the non-differentiability of the loss, the stationary points that we study are directional stationary points, rather than other notions like Clarke stationary points. We show that, if a stationary point does not contain "escape neurons", which are defined with first-order conditions, it must be a local minimum. Moreover, for the scalar-output case, the presence of an escape neuron guarantees that the stationary point is not a local minimum. Our results refine the description of the saddle-to-saddle training process starting from infinitesimally small (vanishing) initialization for shallow ReLU-like networks: By precluding the saddle escape types that previous works did not rule out, we advance one step closer to a complete picture of the entire dynamics. Moreover, we are also able to fully discuss how network embedding, which is to instantiate a narrower network with a wider network, reshapes the stationary points.

Loss Landscape of Shallow ReLU-like Neural Networks: Stationary Points, Saddle Escape, and Network Embedding

TL;DR

Abstract

Paper Structure (69 sections, 11 theorems, 90 equations, 19 figures)

This paper contains 69 sections, 11 theorems, 90 equations, 19 figures.

Introduction
Related Work
Stationary Points.
Training Dynamics.
Network Embedding.
Main Contribution
Setup
Network Architecture.
Loss Function.
Stationary Points
Deriving One-sided Directional Derivatives
Identifying Stationary Points
Comparison with Other Notions of Stationarity
Main Results
Properties of Stationary Points
...and 54 more sections

Key Result

Lemma 3.4

We first denote $(\hat{y}_{kj} - y_{kj}) \coloneqq e_{kj}$. Then, we define the following quantities: We have that

Figures (19)

Figure 1: network architecture
Figure 2: A diagram demonstrating the non-differentiability with respect to the input weights. Here, we show a case where input weights and training inputs are $3$-dimensional. There are two training inputs and two input weights. $\mathbf{w}_1$ lies in a plane orthogonal to $\mathbf{x}_2$. The loss is thus locally non-differentiable since it contains the term $\rho(\mathbf{w}_1\cdot\mathbf{x}_2)$. However, we can compute the ODD with respect to $\mathbf{w}_1$ since the loss function with $\mathbf{w}_1$ constrained on either side of the plane or on the plane is a polynomial of $\mathbf{w}_1$.
Figure 3: Evolution of all parameters during the training process from vanishing initialization.(a) The loss curve encounters three plateaus, the last of which corresponds to a local minimum (confirmed by \ref{['thm: local minimum general output dimension']}). We mark the end of the plateaus with dashed vertical lines. (b) The input weights that are not associated with dead neurons are grouped at several angles. (c) & (d) Grouped neurons have their amplitude increased from near-zero values, which coincides with the saddle escape. The movements of $\Vert \mathbf{w}_{i}\Vert$ and $\vert h_{j_0i}\vert$ are synchronous (see \ref{['fact: input output weight same']}).
Figure 4: A flow chart for the training process. We preclude other schemes of saddle escape, for example, by splitting aligned neurons, which is possible in embeddingNEURIPS2019mild-overparampesme2024saddle. Also, note that besides the indispensable amplitude increase of small living neurons, saddle escape might also be accompanied by amplitude and orientation changes of other neurons.
Figure 5: unit replication
...and 14 more figures

Theorems & Definitions (36)

Definition 3.1
Remark 3.2
Remark 3.3
Lemma 3.4
Definition 3.5
Definition 3.6
Definition 3.7
Definition 4.1
Theorem 4.2
Remark 4.3
...and 26 more

Loss Landscape of Shallow ReLU-like Neural Networks: Stationary Points, Saddle Escape, and Network Embedding

TL;DR

Abstract

Loss Landscape of Shallow ReLU-like Neural Networks: Stationary Points, Saddle Escape, and Network Embedding

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (19)

Theorems & Definitions (36)