Geometry of Critical Sets and Existence of Saddle Branches for Two-layer Neural Networks

Leyang Zhang; Yaoyu Zhang; Tao Luo

Geometry of Critical Sets and Existence of Saddle Branches for Two-layer Neural Networks

Leyang Zhang, Yaoyu Zhang, Tao Luo

TL;DR

The paper develops a geometric framework for two-layer neural networks to analyze the full set of critical points representing a given output function. By introducing the critical embedding and critical reduction operators, it shows that non-global critical points form a finite union of branches with a hierarchical, width-dependent structure and provides precise dimension bounds. It also proves that whenever the output can be represented by a narrower network (minimal width $r$ with $r<m$), the corresponding critical set contains saddle branches, illuminating the role of saddles in training dynamics. These results lay a rigorous foundation for understanding optimization landscapes and gradient flows in overparameterized two-layer networks, with implications for training behavior and network design.

Abstract

This paper presents a comprehensive analysis of critical point sets in two-layer neural networks. To study such complex entities, we introduce the critical embedding operator and critical reduction operator as our tools. Given a critical point, we use these operators to uncover the whole underlying critical set representing the same output function, which exhibits a hierarchical structure. Furthermore, we prove existence of saddle branches for any critical set whose output function can be represented by a narrower network. Our results provide a solid foundation to the further study of optimization and training behavior of neural networks.

Geometry of Critical Sets and Existence of Saddle Branches for Two-layer Neural Networks

TL;DR

with

), the corresponding critical set contains saddle branches, illuminating the role of saddles in training dynamics. These results lay a rigorous foundation for understanding optimization landscapes and gradient flows in overparameterized two-layer networks, with implications for training behavior and network design.

Abstract

Paper Structure (12 sections, 16 theorems, 34 equations, 3 figures)

This paper contains 12 sections, 16 theorems, 34 equations, 3 figures.

Introduction
Related Works
Main Results
Preliminaries
Illustration of Main Results
Criticality Preserving Operators
Geometry and Functional Properties of Critical Sets
Geometry of Critical Sets
Saddle and Saddle Connectivity
Conclusion
Acknowledgement
Appendix

Key Result

Lemma 3.1

Let $\sigma: \mathbb{R} \to \mathbb{R}$ be an analytic non-polynomial. Then for any $d \in \mathbb{N}$, $m \ge 2$ and any $w_1, ..., w_m \in \mathbb{R}^d$, the neurons $\sigma(w_1^\text{T} x), ..., \sigma(w_m^\text{T} x)$ are linearly independent if and only if every two of them are linearly indepen

Figures (3)

Figure 1: Illustration of $\mathcal{C}^{1,l}$ for $0 \le l \le m-1$ defined above. For each $0 \le l \le m-1$, $\mathcal{C}^{1,0}$ is an affine subspace of dimension $(m-l-1) + l = m-1$. The set $\mathcal{C}^{1,0}$ has strict saddles, and points in all the other $\mathcal{C}^{1,l}$ with $1 \le l \le m-1$ are saddles. Moreover, these critical sets are connected to one another, as shown in this figure. See Lemma \ref{['ALem for example']} and its remark for a proof.
Figure 2: Illustration of our method above to show every point in $\mathcal{C}^{1,l}$ with $l > 1$ is a saddle. Starting from $\theta^* \in \mathcal{C}^{1,l}$, we first perturb it to $\Tilde{\theta}$ with the same loss value as that of $\theta^*$, then, using the fact that $\nabla R(g_m, \theta^*) \ne 0$, we perturb it to a $\theta$ arbitrarily close to $\Tilde{\theta}$ with $R(g_m, \theta) < R(g_m, \Tilde{\theta}) = R(g_m, \theta^*)$.
Figure 3: As illustrated by the figure, $\mathcal{C}^{r,0}, \mathcal{C}^{r,1}, ..., \mathcal{C}^{r,m-r}$ are connected to one another, and when $\sigma$ for each $1 \le l \le m$ the branch $\mathcal{C}^{r,l}$ is an analytic set of dimension $(m-l-r) + N + (d-1)$ with $N = \dim (\nabla R(g_r, \cdot))^{-1}(0)$. Furthermore, each $\mathcal{C}^{r,l}$ with $1 \le l \le m-r$ consist only of saddles, while the branch $\mathcal{C}^{r,0}$ has strict saddles provided that the hypothesis in \ref{['Prop Embedding saddles']} is satisfied.

Theorems & Definitions (38)

Lemma 3.1
proof
Theorem 3.1: branch geometry -- informal
Theorem 3.2: existence of saddles -- informal
Definition 4.1: permutation action
Definition 4.2: stratification of parameter space
Definition 4.3: critical embedding operator, summary from EbddPrincipleShort
Definition 4.4: critical reduction operator
Proposition 4.1: properties of critical embedding and critical reduction operators
proof
...and 28 more

Geometry of Critical Sets and Existence of Saddle Branches for Two-layer Neural Networks

TL;DR

Abstract

Geometry of Critical Sets and Existence of Saddle Branches for Two-layer Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (38)