Embedding Principle in Depth for the Loss Landscape Analysis of Deep Neural Networks

Zhiwei Bai; Tao Luo; Zhi-Qin John Xu; Yaoyu Zhang

Embedding Principle in Depth for the Loss Landscape Analysis of Deep Neural Networks

Zhiwei Bai, Tao Luo, Zhi-Qin John Xu, Yaoyu Zhang

TL;DR

This work addresses how depth shapes neural network loss landscapes by proving an embedding principle in depth. It introduces a critical lifting operator that maps a shallow network's critical points to lifted critical manifolds in a deeper network while preserving outputs on the training set. Key contributions include the depth embedding theorem, preservation of network outputs and Hessian inertia, data-dependent reduction of lifted manifolds with more data, BN-based avoidance of liftings, and a practical layer-pruning technique. The findings illuminate a depth-wise hierarchical structure of loss landscapes and provide practical guidance for training, regularization, and compression of deep models.

Abstract

Understanding the relation between deep and shallow neural networks is extremely important for the theoretical study of deep learning. In this work, we discover an embedding principle in depth that loss landscape of an NN "contains" all critical points of the loss landscapes for shallower NNs. The key tool for our discovery is the critical lifting operator proposed in this work that maps any critical point of a network to critical manifolds of any deeper network while preserving the outputs. This principle provides new insights to many widely observed behaviors of DNNs. Regarding the easy training of deep networks, we show that local minimum of an NN can be lifted to strict saddle points of a deeper NN. Regarding the acceleration effect of batch normalization, we demonstrate that batch normalization helps avoid the critical manifolds lifted from shallower NNs by suppressing layer linearization. We also prove that increasing training data shrinks the lifted critical manifolds, which can result in acceleration of training as demonstrated in experiments. Overall, our discovery of the embedding principle in depth uncovers the depth-wise hierarchical structure of deep learning loss landscape, which serves as a solid foundation for the further study about the role of depth for DNNs.

Embedding Principle in Depth for the Loss Landscape Analysis of Deep Neural Networks

TL;DR

Abstract

Paper Structure (35 sections, 18 theorems, 48 equations, 12 figures)

This paper contains 35 sections, 18 theorems, 48 equations, 12 figures.

Introduction
Related works
Preliminaries
Deep neural networks.
Loss function.
Back propagation.
Theory of Embedding Principle in Depth
Lifting operator
Embedding Principle in Depth
Numerical experiments
Experimental setup
Measuring layer linearization by Minimal Pearson Correlation (MPC).
Training dynamics of deep and shallow neural networks
Deep neural networks encounter lifted critical points during practical training
Incremental degeneracy of critical points through embedding
...and 20 more sections

Key Result

Lemma 4.1

(see Appendix app:proofs: Lem. APP:existence for proof) Given data $S$, an $\mathrm{NN}\bigl(\left\{m_{l}\right\}_{l=0}^{L}\bigr)$ and its one-layer deeper counterpart, $\mathrm{NN}^{\prime}(\{m_l^\prime\}$, $l\in \{0, 1, 2, \cdots, q, \hat{q}, q+1, \cdots, L\})$, the one-layer lifting $\mathcal{T}_

Figures (12)

Figure 1: The training dynamics of networks of different depths exhibit similarity. (a, c) The training loss for NNs of varying depths on the Iris and MNIST datasets, respectively. (b, d) The corresponding training accuracy for NNs of varying depths on the Iris and MNIST datasets, respectively. The color-coded areas indicate periods of slow change in training loss or training accuracy, indicating a possible encounter with a saddle point.
Figure 2: Illustration of one-layer lifting. The pink layer is inserted into the left network to get the right network. The input parameters $\bm{W}^{\prime[\hat{q}]}$ and output parameters $\bm{W}^{\prime[q+1]}$ of the inserted layer are obtained by factorizing the input parameters $\bm{W}^{[ q+1]}$ of $(q+1)$-th layer in the left network to satisfy layer linearization and output preserving conditions.
Figure 3: Deep neural networks encounter lifted critical points during training on synthetic data. (a) The training loss for single-hidden-layer and three-hidden-layer NNs with width $m=50$. (b) The outputs of NNs with different depths at the same loss value indicated by the colored span in (a). (c) The extent of layer linearization for different hidden layers during the training process of the three-hidden-layer NN. (d) Training loss trajectory of the reduced single-hidden- layer NN. The green dot in (a) and (c) is selected as a representative for comparison.
Figure 4: Deep neural networks encounter lifted critical points during training on Iris data. (a) The training loss for single-hidden-layer and three-hidden-layer NNs with width $m=50$. (b) The training accuracy of NNs with different depths at the same loss value indicated by the colored span in (a). The accuracy plateau is at 66.7% for both train and test sets. (c) The extent of layer linearization for different hidden layers during the training process of the three-hidden-layer NN. (d) Training loss trajectory of the reduced single-hidden- layer NN. The green dot in (a) and (c) is selected as a representative for comparison.
Figure 5: Incremental degeneracy of critical points through embedding. (a, b) The eigenvalues of Hessian of ReLU NNs at the critical points embedded from the single hidden layer NN for learning data in Fig. \ref{['fig:3-hidden-nonlinear']}(b) and Iris data, respectively. The results for each plot are averaged over 100 random orthogonal similarity transformations. The auxiliary dashed lines in (a, b) delineate the empirical boundary between zero and non-zero eigenvalues. We perform the embedding operation by factorizing one hidden layer into $k$ hidden layers ($k=2, 3$), whose input weights are identity and biases are selected to translate the input range into the affine subdomain.
...and 7 more figures

Theorems & Definitions (41)

Definition 3.1: deeper/shallower
Remark 3.1
Remark 4.1
Definition 4.1: affine subdomain
Definition 4.2: one-layer lifting
Remark 4.2
Remark 4.3
Lemma 4.1: existence of one-layer lifting
Lemma 4.2: computation of feature vectors, feature gradients and error vectors
Proposition 4.1: network properties preserving
...and 31 more

Embedding Principle in Depth for the Loss Landscape Analysis of Deep Neural Networks

TL;DR

Abstract

Embedding Principle in Depth for the Loss Landscape Analysis of Deep Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (41)