Table of Contents
Fetching ...

Geometry and Local Recovery of Global Minima of Two-layer Neural Networks at Overparameterization

Leyang Zhang, Yaoyu Zhang, Tao Luo

TL;DR

It is shown how global minima with zero generalization error become geometrically separated from other global minima as the sample size grows; and the local convergence properties and rate of gradient flow dynamics.

Abstract

Under mild assumptions, we investigate the geometry of the loss landscape for two-layer neural networks in the vicinity of global minima. Utilizing novel techniques, we demonstrate: (i) how global minima with zero generalization error become geometrically separated from other global minima as the sample size grows; and (ii) the local convergence properties and rate of gradient flow dynamics. Our results indicate that two-layer neural networks can be locally recovered in the regime of overparameterization.

Geometry and Local Recovery of Global Minima of Two-layer Neural Networks at Overparameterization

TL;DR

It is shown how global minima with zero generalization error become geometrically separated from other global minima as the sample size grows; and the local convergence properties and rate of gradient flow dynamics.

Abstract

Under mild assumptions, we investigate the geometry of the loss landscape for two-layer neural networks in the vicinity of global minima. Utilizing novel techniques, we demonstrate: (i) how global minima with zero generalization error become geometrically separated from other global minima as the sample size grows; and (ii) the local convergence properties and rate of gradient flow dynamics. Our results indicate that two-layer neural networks can be locally recovered in the regime of overparameterization.
Paper Structure (17 sections, 27 theorems, 70 equations, 8 figures, 2 tables)

This paper contains 17 sections, 27 theorems, 70 equations, 8 figures, 2 tables.

Key Result

Theorem 2.1

Let $\{Q_t\}_{t=1}^N$ be the branches of $Q^*$. Each branch $Q_t$ corresponds to a sample size threshold $N_t \le m(d+1)$ (and if $m > m_0$, we have $N_t < m(d+1)$), such that when sample size $n \ge N_t$, $Q_t$ is "separated" from the imperfect global minima. Moreover, by rearranging the indices of such that whenever $t \le N'$ and $n \ge N_t$, $R$ is not Morse--Bott anywhere at $Q_t$, while for

Figures (8)

  • Figure 1: Overview of theoretical results and their interconnections. The main parts are in dark pink boxes, the basic theories are in green boxes and the other results are in yellow boxes.
  • Figure 2: Illustration of $Q^*$. The closure of the branches $Q_1, Q_2, Q_3$ are all affine subspaces in the parameter space. Moreover, $Q_1, Q_2$ intersects at $(\bar{a}, \bar{w}, 0, \bar{w})$ and $Q_1, Q_3$ intersects at $(0, \bar{w}, \bar{a}, \bar{w})$.
  • Figure 3: Illustration of the example for two-neuron model fitting a one-neuron network. As shown in part (a) of example, $Q^*$ consists of three sets whose closures are one-dimensional affine subspaces. By (b), the loss $R$ is not Morse--Bott near any point in $Q_1$, whence by (c) a gradient flow with limit in $Q_1$ ($\theta_1^*$ in the figure) is in general "biased towards" $\ker\,\mathrm{Hess}\, R(\theta_1^*)$. On the other hand, $R$ is Morse--Bott a.e. at $Q_2$ and $Q_3$, whence a gradient flow with limit in $Q_2 \cup Q_3$ ($\theta_2^*, \theta_3^*$ in the figure) in general converges at linear rate. Finally, note that $Q_{12} = (\bar{a}, \bar{w}, 0, \bar{w})$ and $Q_{13} = (0, \bar{w}, \bar{a}, \bar{w})$ are the points of intersections $\overline{Q_1} \cap \overline{Q_2}$ and $\overline{Q_1} \cap \overline{Q_3}$, respectively.
  • Figure 4: Intersection of different $Q_P^r$'s, view from "$w$-space". The left one shows the intersection of $Q_{\{0,1,2,3\}}^3$ (green surface), $Q_{\{0,1,3\}}^2$ (black line) and $Q_{\{0,3\}}^1$ (red dot). The right one shows the intersection of $Q_P^2$'s. Clearly $Q^2$ consists of three (geometrically) identical branches (green surface) with same $r$ but different permutation. Their intersections are blue lines and the red dot, which are also identical up to permutation.
  • Figure 5: Intersection of branches $Q_{\{0,1,2,3\}}^3$ (green surface), $Q_{\{0,1,3\}}^2$ (tilted black line) and $Q_{\{0,3\}}^1$ (red dot), view from "$a$-space".
  • ...and 3 more figures

Theorems & Definitions (43)

  • Remark 1
  • Theorem 2.1: separation of branches in $Q^*$
  • Theorem 2.2: gradient flow near global minima
  • Lemma 2: characterization of $\varepsilon$-polynomial
  • Corollary 3.1: linear independence of neurons
  • Remark 3
  • Remark 4
  • Proposition 3.1
  • Remark 5: proof techniques
  • Corollary 3.2: separating inputs
  • ...and 33 more