Table of Contents
Fetching ...

Geometry of Critical Sets and Existence of Saddle Branches for Two-layer Neural Networks

Leyang Zhang, Yaoyu Zhang, Tao Luo

TL;DR

The paper develops a geometric framework for two-layer neural networks to analyze the full set of critical points representing a given output function. By introducing the critical embedding and critical reduction operators, it shows that non-global critical points form a finite union of branches with a hierarchical, width-dependent structure and provides precise dimension bounds. It also proves that whenever the output can be represented by a narrower network (minimal width $r$ with $r<m$), the corresponding critical set contains saddle branches, illuminating the role of saddles in training dynamics. These results lay a rigorous foundation for understanding optimization landscapes and gradient flows in overparameterized two-layer networks, with implications for training behavior and network design.

Abstract

This paper presents a comprehensive analysis of critical point sets in two-layer neural networks. To study such complex entities, we introduce the critical embedding operator and critical reduction operator as our tools. Given a critical point, we use these operators to uncover the whole underlying critical set representing the same output function, which exhibits a hierarchical structure. Furthermore, we prove existence of saddle branches for any critical set whose output function can be represented by a narrower network. Our results provide a solid foundation to the further study of optimization and training behavior of neural networks.

Geometry of Critical Sets and Existence of Saddle Branches for Two-layer Neural Networks

TL;DR

The paper develops a geometric framework for two-layer neural networks to analyze the full set of critical points representing a given output function. By introducing the critical embedding and critical reduction operators, it shows that non-global critical points form a finite union of branches with a hierarchical, width-dependent structure and provides precise dimension bounds. It also proves that whenever the output can be represented by a narrower network (minimal width with ), the corresponding critical set contains saddle branches, illuminating the role of saddles in training dynamics. These results lay a rigorous foundation for understanding optimization landscapes and gradient flows in overparameterized two-layer networks, with implications for training behavior and network design.

Abstract

This paper presents a comprehensive analysis of critical point sets in two-layer neural networks. To study such complex entities, we introduce the critical embedding operator and critical reduction operator as our tools. Given a critical point, we use these operators to uncover the whole underlying critical set representing the same output function, which exhibits a hierarchical structure. Furthermore, we prove existence of saddle branches for any critical set whose output function can be represented by a narrower network. Our results provide a solid foundation to the further study of optimization and training behavior of neural networks.
Paper Structure (12 sections, 16 theorems, 34 equations, 3 figures)

This paper contains 12 sections, 16 theorems, 34 equations, 3 figures.

Key Result

Lemma 3.1

Let $\sigma: \mathbb{R} \to \mathbb{R}$ be an analytic non-polynomial. Then for any $d \in \mathbb{N}$, $m \ge 2$ and any $w_1, ..., w_m \in \mathbb{R}^d$, the neurons $\sigma(w_1^\text{T} x), ..., \sigma(w_m^\text{T} x)$ are linearly independent if and only if every two of them are linearly indepen

Figures (3)

  • Figure 1: Illustration of $\mathcal{C}^{1,l}$ for $0 \le l \le m-1$ defined above. For each $0 \le l \le m-1$, $\mathcal{C}^{1,0}$ is an affine subspace of dimension $(m-l-1) + l = m-1$. The set $\mathcal{C}^{1,0}$ has strict saddles, and points in all the other $\mathcal{C}^{1,l}$ with $1 \le l \le m-1$ are saddles. Moreover, these critical sets are connected to one another, as shown in this figure. See Lemma \ref{['ALem for example']} and its remark for a proof.
  • Figure 2: Illustration of our method above to show every point in $\mathcal{C}^{1,l}$ with $l > 1$ is a saddle. Starting from $\theta^* \in \mathcal{C}^{1,l}$, we first perturb it to $\Tilde{\theta}$ with the same loss value as that of $\theta^*$, then, using the fact that $\nabla R(g_m, \theta^*) \ne 0$, we perturb it to a $\theta$ arbitrarily close to $\Tilde{\theta}$ with $R(g_m, \theta) < R(g_m, \Tilde{\theta}) = R(g_m, \theta^*)$.
  • Figure 3: As illustrated by the figure, $\mathcal{C}^{r,0}, \mathcal{C}^{r,1}, ..., \mathcal{C}^{r,m-r}$ are connected to one another, and when $\sigma$ for each $1 \le l \le m$ the branch $\mathcal{C}^{r,l}$ is an analytic set of dimension $(m-l-r) + N + (d-1)$ with $N = \dim (\nabla R(g_r, \cdot))^{-1}(0)$. Furthermore, each $\mathcal{C}^{r,l}$ with $1 \le l \le m-r$ consist only of saddles, while the branch $\mathcal{C}^{r,0}$ has strict saddles provided that the hypothesis in \ref{['Prop Embedding saddles']} is satisfied.

Theorems & Definitions (38)

  • Lemma 3.1
  • proof
  • Theorem 3.1: branch geometry -- informal
  • Theorem 3.2: existence of saddles -- informal
  • Definition 4.1: permutation action
  • Definition 4.2: stratification of parameter space
  • Definition 4.3: critical embedding operator, summary from EbddPrincipleShort
  • Definition 4.4: critical reduction operator
  • Proposition 4.1: properties of critical embedding and critical reduction operators
  • proof
  • ...and 28 more