Loss Landscape of Shallow ReLU-like Neural Networks: Stationary Points, Saddle Escape, and Network Embedding
Frank Zhengqing Wu, Berfin Simsek, Francois Gaston Ged
TL;DR
This work provides a rigorous, non-smooth analysis of the loss landscape for one-hidden-layer networks with ReLU-like activations by introducing directional stationary points defined via one-sided directional derivatives. It establishes that stationary points without escape neurons are local minima, while in scalar-output cases, the presence of escape neurons prevents local minimality, clarifying saddle-to-saddle training dynamics under vanishing initialization. The paper also extends network embedding to non-differentiable settings, giving conditions under which stationarity and local minimality are preserved during width expansion and detailing how embedding reshapes the stationary structure. Together, these results advance understanding of training dynamics in overparameterized regimes and offer a principled framework for analyzing non-smooth loss landscapes in shallow networks.
Abstract
In this paper, we study the loss landscape of one-hidden-layer neural networks with ReLU-like activation functions trained with the empirical squared loss using gradient descent (GD). We identify the stationary points of such networks, which significantly slow down loss decrease during training. To capture such points while accounting for the non-differentiability of the loss, the stationary points that we study are directional stationary points, rather than other notions like Clarke stationary points. We show that, if a stationary point does not contain "escape neurons", which are defined with first-order conditions, it must be a local minimum. Moreover, for the scalar-output case, the presence of an escape neuron guarantees that the stationary point is not a local minimum. Our results refine the description of the saddle-to-saddle training process starting from infinitesimally small (vanishing) initialization for shallow ReLU-like networks: By precluding the saddle escape types that previous works did not rule out, we advance one step closer to a complete picture of the entire dynamics. Moreover, we are also able to fully discuss how network embedding, which is to instantiate a narrower network with a wider network, reshapes the stationary points.
