Table of Contents
Fetching ...

Deep Network Trainability via Persistent Subspace Orthogonality

Alex Massucco, Davide Murari, Carola-Bibiane Schönlieb

TL;DR

This work tackles vanishing/exploding gradients in deep networks by studying the network Jacobian and establishing both exact Jacobian orthogonality and a relaxed persistent subspace orthogonality (PSO) framework. The authors provide a constructive characterization: exact a.e. Jacobian orthogonality can be achieved with specific piecewise-affine layers (e.g., Modified ResNet and modified feedforward forms) and then generalize to PSO, which requires isometry on a nontrivial subspace shared across depth. They propose two practical initialization strategies to enforce PSO and validate them with extensive experiments on MNIST, Fashion-MNIST, and CIFAR-10, showing substantially improved trainability for very deep feedforward and residual networks, with robustness to subspace dimension and activation choices. The findings offer a principled approach to stabilize gradient propagation in deep architectures and suggest extensions to Transformers and graph neural networks as future directions.

Abstract

Training neural networks via backpropagation is often hindered by vanishing or exploding gradients. In this work, we design architectures that mitigate these issues by analyzing and controlling the network Jacobian. We first provide a unified characterization for a class of networks with orthogonal Jacobian including known architectures and yielding new trainable designs. We then introduce the relaxed notion of persistent subspace orthogonality. This applies to a broader class of networks whose Jacobians are isometries only on a non-trivial subspace. We propose practical mechanisms to enforce this condition and empirically show that it is necessary to sufficiently preserve the gradient norms during backpropagation, enabling the training of very deep networks. We support our theory with extensive experiments.

Deep Network Trainability via Persistent Subspace Orthogonality

TL;DR

This work tackles vanishing/exploding gradients in deep networks by studying the network Jacobian and establishing both exact Jacobian orthogonality and a relaxed persistent subspace orthogonality (PSO) framework. The authors provide a constructive characterization: exact a.e. Jacobian orthogonality can be achieved with specific piecewise-affine layers (e.g., Modified ResNet and modified feedforward forms) and then generalize to PSO, which requires isometry on a nontrivial subspace shared across depth. They propose two practical initialization strategies to enforce PSO and validate them with extensive experiments on MNIST, Fashion-MNIST, and CIFAR-10, showing substantially improved trainability for very deep feedforward and residual networks, with robustness to subspace dimension and activation choices. The findings offer a principled approach to stabilize gradient propagation in deep architectures and suggest extensions to Transformers and graph neural networks as future directions.

Abstract

Training neural networks via backpropagation is often hindered by vanishing or exploding gradients. In this work, we design architectures that mitigate these issues by analyzing and controlling the network Jacobian. We first provide a unified characterization for a class of networks with orthogonal Jacobian including known architectures and yielding new trainable designs. We then introduce the relaxed notion of persistent subspace orthogonality. This applies to a broader class of networks whose Jacobians are isometries only on a non-trivial subspace. We propose practical mechanisms to enforce this condition and empirically show that it is necessary to sufficiently preserve the gradient norms during backpropagation, enabling the training of very deep networks. We support our theory with extensive experiments.

Paper Structure

This paper contains 36 sections, 10 theorems, 78 equations, 5 figures, 4 tables.

Key Result

LEMMA 3.4

Let $\{\Omega_i\subseteq\mathbb{R}^n|\,i=1,\cdots,N\}$ be a collection of sets in $\mathcal{P}(\Omega)$, with $\Omega\subseteq\mathbb{R}^n$ connected. Consider a vector field $F:\mathbb{R}^n\to\mathbb{R}^n$ whose restriction to $\Omega_i$, which we call $F_i:\Omega_i\to\mathbb{R}^n$, is piecewise $C

Figures (5)

  • Figure 1: Test accuracy over 5 runs for a feedforward $\mathrm{ReLU}$-network (type \ref{['eq:type_A']}) and a residual $\mathrm{ReLU}$-network (type \ref{['eq:type_B']}) with different initializations. We consider $L\in\{5,20,50,100, 200\}$ layers. The plotted results are the mean accuracies, and the confidence interval is determined from the minimum and maximum absolute deviations from the mean.
  • Figure 2: Plot of two piecewise affine Lipschitz continuous scalar functions.
  • Figure 3: Sensitivity to the dimension $k$ of the non-isometric subspace. MNIST results for projection and block variants.
  • Figure 4: Ablation over the initialization of the first and last affine layers. MNIST results for projection and block variants for non-orthogonal dimensionality reduction layers.
  • Figure 5: Ablation over different learning rates. Comparison of the test accuracies of four feedforward networks with weights initialized as Kaiming normal and as shared orthogonal weights. We consider two depths, $L \in \{5, 50\}$, and fixed learning rates in $\{10^{-6},10^{-5},10^{-4}\}$. All the models are trained for $20$ epochs.

Theorems & Definitions (27)

  • DEFINITION 3.1
  • DEFINITION 3.2: Piecewise $C^1$ map
  • DEFINITION 3.3: Piecewise affine
  • LEMMA 3.4
  • DEFINITION 3.5
  • THEOREM 3.6
  • REMARK 3.7
  • COROLLARY 3.8: Corollary to Theorem \ref{['thm:char_RtoR_fields']}
  • THEOREM 3.9
  • DEFINITION 4.1: Subspace orthogonality
  • ...and 17 more