Lecture Notes on Linear Neural Networks: A Tale of Optimization and Generalization in Deep Learning

Nadav Cohen; Noam Razin

Lecture Notes on Linear Neural Networks: A Tale of Optimization and Generalization in Deep Learning

Nadav Cohen, Noam Razin

TL;DR

A theory (developed by NC, NR and collaborators) of linear neural networks -- a fundamental model in the study of optimization and generalization in deep learning is presented.

Abstract

These notes are based on a lecture delivered by NC on March 2021, as part of an advanced course in Princeton University on the mathematical understanding of deep learning. They present a theory (developed by NC, NR and collaborators) of linear neural networks -- a fundamental model in the study of optimization and generalization in deep learning. Practical applications born from the presented theory are also discussed. The theory is based on mathematical tools that are dynamical in nature. It showcases the potential of such tools to push the envelope of our understanding of optimization and generalization in deep learning. The text assumes familiarity with the basics of statistical learning theory. Exercises (without solutions) are included.

Lecture Notes on Linear Neural Networks: A Tale of Optimization and Generalization in Deep Learning

TL;DR

A theory (developed by NC, NR and collaborators) of linear neural networks -- a fundamental model in the study of optimization and generalization in deep learning is presented.

Abstract

Paper Structure (12 sections, 11 theorems, 58 equations, 3 figures)

This paper contains 12 sections, 11 theorems, 58 equations, 3 figures.

Introduction
Dynamical Analysis
Optimization
Convergence Guarantee
Implicit Acceleration by Overparameterization
Generalization
Matrix Sensing
Implicit Regularization
Greedy Low Rank Learning
Implicit Compression by Overparameterization
Extension: Arithmetic Neural Networks
Conclusion

Key Result

Proposition 1

If $\ell ( \cdot )$ does not attain its global minimum at the origin then $\phi ( \cdot )$ is non-convex.

Figures (3)

Figure 1: Empirical demonstrations of implicit acceleration by overparameterization, i.e., by replacement of linear transformations with linear neural networks. Left plot compares gradient descent applied to $\ell_4$ loss for a linear regression task ("linear"), against its application to the overparameterized objectives induced by two and three layer linear neural networks ("2 layer LNN" and "3 layer LNN," respectively). Right plot compares stochastic gradient descent (with momentum) optimizing a non-linear convolutional neural network ("original"), against it optimizing a model obtained from replacing each dense linear transformation with a two layer linear neural network ("overparameterized"). For details see arora2018optimization, from which results are taken.
Figure 2: Experiment comparing, on matrix completion (special case of matrix sensing) tasks, the global minimizer of lowest nuclear norm ("min nuclear") against solutions produced by two and three layer linear neural networks ("2 layer LNN" and "3 layer LNN," respectively). Each task entails a different number of observations (measurements), taken from a low rank ground truth matrix. Left and right plots respectively display reconstruction errors (distances from ground truth) and nuclear norms of the solutions on each task. Notice that on tasks with many observations, the difference between ground truth and global minimizer of lowest nuclear norm is slight, and the linear neural networks converge to these. In contrast, on tasks with few observations the difference is significant, and the linear neural networks (especially the deeper one) choose low rank ground truth over global minimizer of lowest nuclear norm. For details see arora2019implicit, from which results are taken.
Figure 3: Empirical demonstration of the greedy low rank learning process brought forth by the implicit regularization of linear neural networks. Plots show, for a matrix sensing task comprising measurements taken from a low rank ground truth, singular values of the learned solution throughout the iterations of gradient descent. Left ("linear") plot corresponds to direct minimization of the matrix sensing loss; middle ("2 layer LNN") and right ("3 layer LNN") plots correspond to minimization of the overparameterized objectives induced by two and three layer linear neural networks, respectively. Plot titles specify reconstruction error (distance of learned solution from ground truth matrix) at the end of training. Notice that the greedy low rank learning process takes place only with the linear neural networks, and is sharper with the deeper network. For details see arora2019implicit, from which results are taken.

Theorems & Definitions (31)

Proposition 1
proof
Lemma 1
proof
Definition 1
Proposition 2
proof
Definition 2
Theorem 1: end-to-end dynamics
proof
...and 21 more

Lecture Notes on Linear Neural Networks: A Tale of Optimization and Generalization in Deep Learning

TL;DR

Abstract

Lecture Notes on Linear Neural Networks: A Tale of Optimization and Generalization in Deep Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (31)