Feature Learning and Generalization in Deep Networks with Orthogonal Weights

Hannah Day; Yonatan Kahn; Daniel A. Roberts

Feature Learning and Generalization in Deep Networks with Orthogonal Weights

Hannah Day, Yonatan Kahn, Daniel A. Roberts

TL;DR

This work demonstrates that orthogonal weight initializations in deep MLPs produce depth-independent preactivation fluctuations and saturating NTK correlators at finite width, in contrast to Gaussian initializations where fluctuations grow with depth. By deriving kernel and 4-point vertex recursions and validating them empirically, it shows that tanh (and linear) activations place networks in a regime where finite-width feature learning remains robust up to depths around $L\approx20$, while ReLU retains some depth-dependent growth. The study connects these fluctuation properties to training dynamics and generalization, providing empirical evidence from MNIST and CIFAR-10 that orthogonal networks train faster and generalize better in the regime $L/n \sim 1$. These results offer a principled explanation for why orthogonal initializations can improve performance in deep networks and point to new directions for optimizing initialization schemes in finite-width settings.

Abstract

Fully-connected deep neural networks with weights initialized from independent Gaussian distributions can be tuned to criticality, which prevents the exponential growth or decay of signals propagating through the network. However, such networks still exhibit fluctuations that grow linearly with the depth of the network, which may impair the training of networks with width comparable to depth. We show analytically that rectangular networks with tanh activations and weights initialized from the ensemble of orthogonal matrices have corresponding preactivation fluctuations which are independent of depth, to leading order in inverse width. Moreover, we demonstrate numerically that, at initialization, all correlators involving the neural tangent kernel (NTK) and its descendants at leading order in inverse width -- which govern the evolution of observables during training -- saturate at a depth of $\sim 20$, rather than growing without bound as in the case of Gaussian initializations. We speculate that this structure preserves finite-width feature learning while reducing overall noise, thus improving both generalization and training speed in deep networks with depth comparable to width. We provide some experimental justification by relating empirical measurements of the NTK to the superior performance of deep nonlinear orthogonal networks trained under full-batch gradient descent on the MNIST and CIFAR-10 classification tasks.

Feature Learning and Generalization in Deep Networks with Orthogonal Weights

TL;DR

, while ReLU retains some depth-dependent growth. The study connects these fluctuation properties to training dynamics and generalization, providing empirical evidence from MNIST and CIFAR-10 that orthogonal networks train faster and generalize better in the regime

. These results offer a principled explanation for why orthogonal initializations can improve performance in deep networks and point to new directions for optimizing initialization schemes in finite-width settings.

Abstract

, rather than growing without bound as in the case of Gaussian initializations. We speculate that this structure preserves finite-width feature learning while reducing overall noise, thus improving both generalization and training speed in deep networks with depth comparable to width. We provide some experimental justification by relating empirical measurements of the NTK to the superior performance of deep nonlinear orthogonal networks trained under full-batch gradient descent on the MNIST and CIFAR-10 classification tasks.

Paper Structure (21 sections, 75 equations, 20 figures)

This paper contains 21 sections, 75 equations, 20 figures.

Introduction
Notation and preliminaries
Exact preactivation distribution for linear orthogonal networks
Critical initialization and fluctuations for nonlinear orthogonal networks
Kernel recursion and critical initialization hyperparameters
4-point vertex recursion
Depth dependence of $V$ for various activations
NLO metric
Relation to previous work
Measurement of NTK statistics
NTK and learning rate definitions
Measurements of NTK, dNTK, and ddNTK statistics at initialization
Training and generalization experiments
Experimental setup
Results for training dynamics
...and 6 more sections

Figures (20)

Figure 1: Normalized single-input vertex $\widetilde{V}$ with $n=100$, in an ensemble of 1000 networks. Measured correlators are shown by dots, triangles, and stars; the theory lines are derived in Sec. \ref{['sec:V']} for orthogonal and in Ch. 5 of Ref. Roberts:2021fes for Gaussian. Left: Gaussian and orthogonal initializations, with linear, ReLU, and tanh activations. Right: orthogonal and mixed initializations for linear and tanh activations.
Figure 2: Measurements of normalized single-input correlators for Gaussian (left) and orthogonal (right) initializations, with $n = 50$ and tanh activations, in an ensemble of 100 networks. The Gaussian correlators (except for $\widetilde{U}$) grow proportional to $\pm \ell$ as predicted by the analysis of Ref. Roberts:2021fes, while the orthogonal correlators are smaller in overall magnitude and begin to saturate with depth.
Figure 3: Measurements of normalized single-input correlators for Gaussian (left) and orthogonal (right) initializations, with $n = 20$ and tanh activations, in an ensemble of 100 networks. The Gaussian correlators begin to grow exponentially in magnitude and fluctuate chaotically when $\ell \simeq n$, but the orthogonal correlators saturate at a depth $\ell \sim 20$. By $\ell = 30$, the Gaussian correlators are orders of magnitude larger than the orthogonal correlators.
Figure 4: MSE validation loss versus epoch for Gaussian (left) and orthogonal (right) initializations, with $n = 30$ and tanh activations, averaged over 10 runs on MNIST data. The Gaussian networks suffer from both slower training and worse generalization as the depth increases, while the loss curves for orthogonal networks begin to lie on top of one another for $L \gtrsim 20$.
Figure 5: MSE validation loss versus epoch for Gaussian (left) and orthogonal (right) initializations, with $n = 30$ and tanh activations, averaged over 5 runs on CIFAR-10 data. As with MNIST, the loss curves for Gaussian weights exhibit slower training and worse generalization at large depths compared to orthogonal initializations.
...and 15 more figures

Feature Learning and Generalization in Deep Networks with Orthogonal Weights

TL;DR

Abstract

Feature Learning and Generalization in Deep Networks with Orthogonal Weights

Authors

TL;DR

Abstract

Table of Contents

Figures (20)