Towards Understanding Epoch-wise Double descent in Two-layer Linear Neural Networks

Amanda Olmin; Fredrik Lindsten

Towards Understanding Epoch-wise Double descent in Two-layer Linear Neural Networks

Amanda Olmin, Fredrik Lindsten

TL;DR

A gradient flow is derived for the linear two-layer model, that bridges the learning dynamics of the standard linear regression model, and the linear two-layer diagonal network with quadratic weights, and identifies additional factors of epoch-wise double descent emerging with the extra model layer.

Abstract

Epoch-wise double descent is the phenomenon where generalisation performance improves beyond the point of overfitting, resulting in a generalisation curve exhibiting two descents under the course of learning. Understanding the mechanisms driving this behaviour is crucial not only for understanding the generalisation behaviour of machine learning models in general, but also for employing conventional selection methods, such as the use of early stopping to mitigate overfitting. While we ultimately want to draw conclusions of more complex models, such as deep neural networks, a majority of theoretical results regarding the underlying cause of epoch-wise double descent are based on simple models, such as standard linear regression. In this paper, to take a step towards more complex models in theoretical analysis, we study epoch-wise double descent in two-layer linear neural networks. First, we derive a gradient flow for the linear two-layer model, that bridges the learning dynamics of the standard linear regression model, and the linear two-layer diagonal network with quadratic weights. Second, we identify additional factors of epoch-wise double descent emerging with the extra model layer, by deriving necessary conditions for the generalisation error to follow a double descent pattern. While epoch-wise double descent in linear regression has been attributed to differences in input variance, in the two-layer model, also the singular values of the input-output covariance matrix play an important role. This opens up for further questions regarding unidentified factors of epoch-wise double descent for truly deep models.

Towards Understanding Epoch-wise Double descent in Two-layer Linear Neural Networks

TL;DR

Abstract

Paper Structure (25 sections, 7 theorems, 162 equations, 1 figure)

This paper contains 25 sections, 7 theorems, 162 equations, 1 figure.

Introduction
Preliminaries
Theory
Decoupled dynamics of the two-layer linear network
Epoch-wise double descent in the two-layer model
Analysis of error curves
Necessary condition for epoch-wise double descent
Discussion
Incremental learning and epoch-wise double descent
Deeper models
The effect of coupling
Theoretical derivations
Decoupled dynamics of two-layer linear networks
Deriving the bridged dynamics
Proof of \ref{['prop:dyn_sol']}
...and 10 more sections

Key Result

Proposition 1

Consider $z_i(t)$ initialised at $z_i(0) < {\sigma_i} / \lambda_i$ and assume $\lambda_i > 0$. If $\gamma_i \neq 0$, the solution to eq:two_layer_dynamics_z_relaxed is with Moreover, and the weight $z_i(t)$ converges to the point $z^*_i = {\sigma_i} / \lambda_i$ at a rate $\mathcal{O}(e^{-\sqrt{ \gamma_i^2 \lambda_i^2 + 4\eta^2 {{\sigma_i}}^2 } t})$. If instead $\gamma_i = 0$, the above (eq:z_d

Figures (1)

Figure 1: Examples of double descent with $|S_{\mathcal{A}}|=10$ active weights, and where active weights are divided into two sets; one set evolving as $z_i(t)$ and the other as $z_j(t)$. We let $\gamma_i=\gamma_j=\gamma$ and consider three scenarios with different values of the parameters $\gamma, \eta$ (Left: one-layer dynamics with $\gamma=0.005, \eta=0$. Middle: bridged dynamics with $\gamma=0.0025, \eta=0.0025$. Right: balanced dynamics with $\gamma=0, \eta=0.005$.). As default, remaining parameters are set according to $\lambda_i = \lambda_j = 1.0, {\sigma_i}=\sigma_j = 2.5, \rho_i=0.5, \rho_j=0.8$ and $z_i(0)=z_j(0)= 0.01$. Top: Changing $\lambda_i$. The weight $z_i(t)$ has multiplicity $9$, while $z_j(t)$ has multiplicity $1$. We observe double descent for large $\lambda_i$ in all three scenarios, but double descent seems to appear for a smaller $\lambda_i$ when $\gamma$ is larger. Bottom: Changing ${\sigma_i}$. The weight $z_i(t)$ has multiplicity $1$, while $z_j(t)$ has multiplicity $9$. We observe double descent for large ${\sigma_i}$ only in the scenarios where $\eta > 0$.

Theorems & Definitions (7)

Proposition 1
Lemma 1
Proposition 2
Lemma 2
Lemma 3
Proposition 3
Corollary 1

Towards Understanding Epoch-wise Double descent in Two-layer Linear Neural Networks

TL;DR

Abstract

Towards Understanding Epoch-wise Double descent in Two-layer Linear Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (7)