Geometry and Local Recovery of Global Minima of Two-layer Neural Networks at Overparameterization

Leyang Zhang; Yaoyu Zhang; Tao Luo

Geometry and Local Recovery of Global Minima of Two-layer Neural Networks at Overparameterization

Leyang Zhang, Yaoyu Zhang, Tao Luo

TL;DR

It is shown how global minima with zero generalization error become geometrically separated from other global minima as the sample size grows; and the local convergence properties and rate of gradient flow dynamics.

Abstract

Under mild assumptions, we investigate the geometry of the loss landscape for two-layer neural networks in the vicinity of global minima. Utilizing novel techniques, we demonstrate: (i) how global minima with zero generalization error become geometrically separated from other global minima as the sample size grows; and (ii) the local convergence properties and rate of gradient flow dynamics. Our results indicate that two-layer neural networks can be locally recovered in the regime of overparameterization.

Geometry and Local Recovery of Global Minima of Two-layer Neural Networks at Overparameterization

TL;DR

Abstract

Paper Structure (17 sections, 27 theorems, 70 equations, 8 figures, 2 tables)

This paper contains 17 sections, 27 theorems, 70 equations, 8 figures, 2 tables.

Introduction
A Glance at this Paper
Notations and Assumptions
Local Recovery Problem
Main Results
Preparing Lemmas and Propositions
Linear Independence of Neurons
Theory of Real Analytic Functions
Separating Inputs are Almost Everywhere
Geometry of M
Loss Landscape Near M
Dynamics of Gradient Flow Near M
Limiting Set of Gradient Flow
Convergence Rate and Limiting Direction of gradient flow
Local Recovery by Gradient Flow
...and 2 more sections

Key Result

Theorem 2.1

Let $\{Q_t\}_{t=1}^N$ be the branches of $Q^*$. Each branch $Q_t$ corresponds to a sample size threshold $N_t \le m(d+1)$ (and if $m > m_0$, we have $N_t < m(d+1)$), such that when sample size $n \ge N_t$, $Q_t$ is "separated" from the imperfect global minima. Moreover, by rearranging the indices of such that whenever $t \le N'$ and $n \ge N_t$, $R$ is not Morse--Bott anywhere at $Q_t$, while for

Figures (8)

Figure 1: Overview of theoretical results and their interconnections. The main parts are in dark pink boxes, the basic theories are in green boxes and the other results are in yellow boxes.
Figure 2: Illustration of $Q^*$. The closure of the branches $Q_1, Q_2, Q_3$ are all affine subspaces in the parameter space. Moreover, $Q_1, Q_2$ intersects at $(\bar{a}, \bar{w}, 0, \bar{w})$ and $Q_1, Q_3$ intersects at $(0, \bar{w}, \bar{a}, \bar{w})$.
Figure 3: Illustration of the example for two-neuron model fitting a one-neuron network. As shown in part (a) of example, $Q^*$ consists of three sets whose closures are one-dimensional affine subspaces. By (b), the loss $R$ is not Morse--Bott near any point in $Q_1$, whence by (c) a gradient flow with limit in $Q_1$ ($\theta_1^*$ in the figure) is in general "biased towards" $\ker\,\mathrm{Hess}\, R(\theta_1^*)$. On the other hand, $R$ is Morse--Bott a.e. at $Q_2$ and $Q_3$, whence a gradient flow with limit in $Q_2 \cup Q_3$ ($\theta_2^*, \theta_3^*$ in the figure) in general converges at linear rate. Finally, note that $Q_{12} = (\bar{a}, \bar{w}, 0, \bar{w})$ and $Q_{13} = (0, \bar{w}, \bar{a}, \bar{w})$ are the points of intersections $\overline{Q_1} \cap \overline{Q_2}$ and $\overline{Q_1} \cap \overline{Q_3}$, respectively.
Figure 4: Intersection of different $Q_P^r$'s, view from "$w$-space". The left one shows the intersection of $Q_{\{0,1,2,3\}}^3$ (green surface), $Q_{\{0,1,3\}}^2$ (black line) and $Q_{\{0,3\}}^1$ (red dot). The right one shows the intersection of $Q_P^2$'s. Clearly $Q^2$ consists of three (geometrically) identical branches (green surface) with same $r$ but different permutation. Their intersections are blue lines and the red dot, which are also identical up to permutation.
Figure 5: Intersection of branches $Q_{\{0,1,2,3\}}^3$ (green surface), $Q_{\{0,1,3\}}^2$ (tilted black line) and $Q_{\{0,3\}}^1$ (red dot), view from "$a$-space".
...and 3 more figures

Theorems & Definitions (43)

Remark 1
Theorem 2.1: separation of branches in $Q^*$
Theorem 2.2: gradient flow near global minima
Lemma 2: characterization of $\varepsilon$-polynomial
Corollary 3.1: linear independence of neurons
Remark 3
Remark 4
Proposition 3.1
Remark 5: proof techniques
Corollary 3.2: separating inputs
...and 33 more

Geometry and Local Recovery of Global Minima of Two-layer Neural Networks at Overparameterization

TL;DR

Abstract

Geometry and Local Recovery of Global Minima of Two-layer Neural Networks at Overparameterization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (43)