Convergence Analysis for Learning Orthonormal Deep Linear Neural Networks

Zhen Qin; Xuwei Tan; Zhihui Zhu

Convergence Analysis for Learning Orthonormal Deep Linear Neural Networks

Zhen Qin, Xuwei Tan, Zhihui Zhu

TL;DR

This work provides a local convergence analysis for Riemannian gradient descent on orthonormal deep linear neural networks (ODLNNs) where all but the first layer are constrained to be orthonormal. Using a whitened data model and a teacher ODLNN, the authors establish a Riemannian-regularity-based linear convergence rate under a restricted correlated gradient condition, with depth affecting the rate only polynomially. The RGD updates leverage tangent-space projections on the Stiefel manifold and a polar-retraction to enforce orthonormality, while balancing progress between the unconstrained first layer and the constrained rest. Experiments on MNIST show RGD outperforms standard gradient descent and reveal depth-dependent convergence behavior, with nonlinear activations offering nuanced effects.

Abstract

Enforcing orthonormal or isometric property for the weight matrices has been shown to enhance the training of deep neural networks by mitigating gradient exploding/vanishing and increasing the robustness of the learned networks. However, despite its practical performance, the theoretical analysis of orthonormality in neural networks is still lacking; for example, how orthonormality affects the convergence of the training process. In this letter, we aim to bridge this gap by providing convergence analysis for training orthonormal deep linear neural networks. Specifically, we show that Riemannian gradient descent with an appropriate initialization converges at a linear rate for training orthonormal deep linear neural networks with a class of loss functions. Unlike existing works that enforce orthonormal weight matrices for all the layers, our approach excludes this requirement for one layer, which is crucial to establish the convergence guarantee. Our results shed light on how increasing the number of hidden layers can impact the convergence speed. Experimental results validate our theoretical analysis.

Convergence Analysis for Learning Orthonormal Deep Linear Neural Networks

TL;DR

Abstract

Paper Structure (5 sections, 3 theorems, 17 equations, 1 figure)

This paper contains 5 sections, 3 theorems, 17 equations, 1 figure.

Introduction
Riemannian Gradient Descent for Orthonormal Deep Linear Neural Networks
Convergence Analysis
Experiments
Conclusion

Key Result

Lemma 1

Assume a whitened input $\boldsymbol{X}\in\mathbb{R}^{d_x\times n}$, i.e. $\boldsymbol{X}\boldsymbol{X}^\top = \boldsymbol{I}_{d_x}$. Let $\boldsymbol{Y} \!=\! \boldsymbol{W}_N \cdots \boldsymbol{W}_1 \boldsymbol{X}$ and $\boldsymbol{Y}^\star \!=\! \boldsymbol{W}_N^\star\cdots \boldsymbol{W}_1^\star

Figures (1)

Figure 1: Convergence analysis for GD($N$, $\mu$) and RGD($N$, $\mu$, $\gamma$) with different activation functions and $N$.

Theorems & Definitions (5)

Definition 1: Data model
Lemma 1
Definition 2
Lemma 2
Theorem 1

Convergence Analysis for Learning Orthonormal Deep Linear Neural Networks

TL;DR

Abstract

Convergence Analysis for Learning Orthonormal Deep Linear Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (5)