Table of Contents
Fetching ...

Towards Understanding Epoch-wise Double descent in Two-layer Linear Neural Networks

Amanda Olmin, Fredrik Lindsten

TL;DR

A gradient flow is derived for the linear two-layer model, that bridges the learning dynamics of the standard linear regression model, and the linear two-layer diagonal network with quadratic weights, and identifies additional factors of epoch-wise double descent emerging with the extra model layer.

Abstract

Epoch-wise double descent is the phenomenon where generalisation performance improves beyond the point of overfitting, resulting in a generalisation curve exhibiting two descents under the course of learning. Understanding the mechanisms driving this behaviour is crucial not only for understanding the generalisation behaviour of machine learning models in general, but also for employing conventional selection methods, such as the use of early stopping to mitigate overfitting. While we ultimately want to draw conclusions of more complex models, such as deep neural networks, a majority of theoretical results regarding the underlying cause of epoch-wise double descent are based on simple models, such as standard linear regression. In this paper, to take a step towards more complex models in theoretical analysis, we study epoch-wise double descent in two-layer linear neural networks. First, we derive a gradient flow for the linear two-layer model, that bridges the learning dynamics of the standard linear regression model, and the linear two-layer diagonal network with quadratic weights. Second, we identify additional factors of epoch-wise double descent emerging with the extra model layer, by deriving necessary conditions for the generalisation error to follow a double descent pattern. While epoch-wise double descent in linear regression has been attributed to differences in input variance, in the two-layer model, also the singular values of the input-output covariance matrix play an important role. This opens up for further questions regarding unidentified factors of epoch-wise double descent for truly deep models.

Towards Understanding Epoch-wise Double descent in Two-layer Linear Neural Networks

TL;DR

A gradient flow is derived for the linear two-layer model, that bridges the learning dynamics of the standard linear regression model, and the linear two-layer diagonal network with quadratic weights, and identifies additional factors of epoch-wise double descent emerging with the extra model layer.

Abstract

Epoch-wise double descent is the phenomenon where generalisation performance improves beyond the point of overfitting, resulting in a generalisation curve exhibiting two descents under the course of learning. Understanding the mechanisms driving this behaviour is crucial not only for understanding the generalisation behaviour of machine learning models in general, but also for employing conventional selection methods, such as the use of early stopping to mitigate overfitting. While we ultimately want to draw conclusions of more complex models, such as deep neural networks, a majority of theoretical results regarding the underlying cause of epoch-wise double descent are based on simple models, such as standard linear regression. In this paper, to take a step towards more complex models in theoretical analysis, we study epoch-wise double descent in two-layer linear neural networks. First, we derive a gradient flow for the linear two-layer model, that bridges the learning dynamics of the standard linear regression model, and the linear two-layer diagonal network with quadratic weights. Second, we identify additional factors of epoch-wise double descent emerging with the extra model layer, by deriving necessary conditions for the generalisation error to follow a double descent pattern. While epoch-wise double descent in linear regression has been attributed to differences in input variance, in the two-layer model, also the singular values of the input-output covariance matrix play an important role. This opens up for further questions regarding unidentified factors of epoch-wise double descent for truly deep models.
Paper Structure (25 sections, 7 theorems, 162 equations, 1 figure)

This paper contains 25 sections, 7 theorems, 162 equations, 1 figure.

Key Result

Proposition 1

Consider $z_i(t)$ initialised at $z_i(0) < {\sigma_i} / \lambda_i$ and assume $\lambda_i > 0$. If $\gamma_i \neq 0$, the solution to eq:two_layer_dynamics_z_relaxed is with Moreover, and the weight $z_i(t)$ converges to the point $z^*_i = {\sigma_i} / \lambda_i$ at a rate $\mathcal{O}(e^{-\sqrt{ \gamma_i^2 \lambda_i^2 + 4\eta^2 {{\sigma_i}}^2 } t})$. If instead $\gamma_i = 0$, the above (eq:z_d

Figures (1)

  • Figure 1: Examples of double descent with $|S_{\mathcal{A}}|=10$ active weights, and where active weights are divided into two sets; one set evolving as $z_i(t)$ and the other as $z_j(t)$. We let $\gamma_i=\gamma_j=\gamma$ and consider three scenarios with different values of the parameters $\gamma, \eta$ (Left: one-layer dynamics with $\gamma=0.005, \eta=0$. Middle: bridged dynamics with $\gamma=0.0025, \eta=0.0025$. Right: balanced dynamics with $\gamma=0, \eta=0.005$.). As default, remaining parameters are set according to $\lambda_i = \lambda_j = 1.0, {\sigma_i}=\sigma_j = 2.5, \rho_i=0.5, \rho_j=0.8$ and $z_i(0)=z_j(0)= 0.01$. Top: Changing $\lambda_i$. The weight $z_i(t)$ has multiplicity $9$, while $z_j(t)$ has multiplicity $1$. We observe double descent for large $\lambda_i$ in all three scenarios, but double descent seems to appear for a smaller $\lambda_i$ when $\gamma$ is larger. Bottom: Changing ${\sigma_i}$. The weight $z_i(t)$ has multiplicity $1$, while $z_j(t)$ has multiplicity $9$. We observe double descent for large ${\sigma_i}$ only in the scenarios where $\eta > 0$.

Theorems & Definitions (7)

  • Proposition 1
  • Lemma 1
  • Proposition 2
  • Lemma 2
  • Lemma 3
  • Proposition 3
  • Corollary 1