Table of Contents
Fetching ...

IDInit: A Universal and Stable Initialization Method for Neural Network Training

Yu Pan, Chaozheng Wang, Zekai Wu, Qifan Wang, Min Zhang, Zenglin Xu

TL;DR

IDInit introduces a universal, stable initialization by preserving identity across both main and sub-stem residual branches via a padded identity-like matrix, addressing non-square weight rank constraints. It combines identity-preserving initialization with a zero-preserving variant to mitigate dead neurons and employs a patch-maintain scheme to extend identity propagation to convolutions, further aided by a small-loosened identity to inject diversity. The approach yields faster convergence and higher accuracy across CIFAR-10, ImageNet, and NLP tasks, and even accelerates large-scale pretraining such as BERT-Base, demonstrating robustness to hyperparameters and architectural variations. The work provides theoretical and empirical support for dynamical isometry in IDInit and outlines practical guidelines for applying identity-based initialization to non-square and convolutional layers, suggesting broad applicability in modern deep learning pipelines.

Abstract

Deep neural networks have achieved remarkable accomplishments in practice. The success of these networks hinges on effective initialization methods, which are vital for ensuring stable and rapid convergence during training. Recently, initialization methods that maintain identity transition within layers have shown good efficiency in network training. These techniques (e.g., Fixup) set specific weights to zero to achieve identity control. However, settings of remaining weight (e.g., Fixup uses random values to initialize non-zero weights) will affect the inductive bias that is achieved only by a zero weight, which may be harmful to training. Addressing this concern, we introduce fully identical initialization (IDInit), a novel method that preserves identity in both the main and sub-stem layers of residual networks. IDInit employs a padded identity-like matrix to overcome rank constraints in non-square weight matrices. Furthermore, we show the convergence problem of an identity matrix can be solved by stochastic gradient descent. Additionally, we enhance the universality of IDInit by processing higher-order weights and addressing dead neuron problems. IDInit is a straightforward yet effective initialization method, with improved convergence, stability, and performance across various settings, including large-scale datasets and deep models.

IDInit: A Universal and Stable Initialization Method for Neural Network Training

TL;DR

IDInit introduces a universal, stable initialization by preserving identity across both main and sub-stem residual branches via a padded identity-like matrix, addressing non-square weight rank constraints. It combines identity-preserving initialization with a zero-preserving variant to mitigate dead neurons and employs a patch-maintain scheme to extend identity propagation to convolutions, further aided by a small-loosened identity to inject diversity. The approach yields faster convergence and higher accuracy across CIFAR-10, ImageNet, and NLP tasks, and even accelerates large-scale pretraining such as BERT-Base, demonstrating robustness to hyperparameters and architectural variations. The work provides theoretical and empirical support for dynamical isometry in IDInit and outlines practical guidelines for applying identity-based initialization to non-square and convolutional layers, suggesting broad applicability in modern deep learning pipelines.

Abstract

Deep neural networks have achieved remarkable accomplishments in practice. The success of these networks hinges on effective initialization methods, which are vital for ensuring stable and rapid convergence during training. Recently, initialization methods that maintain identity transition within layers have shown good efficiency in network training. These techniques (e.g., Fixup) set specific weights to zero to achieve identity control. However, settings of remaining weight (e.g., Fixup uses random values to initialize non-zero weights) will affect the inductive bias that is achieved only by a zero weight, which may be harmful to training. Addressing this concern, we introduce fully identical initialization (IDInit), a novel method that preserves identity in both the main and sub-stem layers of residual networks. IDInit employs a padded identity-like matrix to overcome rank constraints in non-square weight matrices. Furthermore, we show the convergence problem of an identity matrix can be solved by stochastic gradient descent. Additionally, we enhance the universality of IDInit by processing higher-order weights and addressing dead neuron problems. IDInit is a straightforward yet effective initialization method, with improved convergence, stability, and performance across various settings, including large-scale datasets and deep models.

Paper Structure

This paper contains 49 sections, 1 theorem, 34 equations, 22 figures, 10 tables.

Key Result

Theorem 3.1

If initializing all weights $\{\theta^{(i)}\}_{i=0}^{2}$ by $\operatorname{IDI}_{1}$, the rank of $\Delta{\theta}^{(1)}$ can attain which breaks the rank constraint.

Figures (22)

  • Figure 1: A case of identity-control initialization, which sets $W_2=\mathbf{0}$ to satisfy $Y=X$.
  • Figure 2: Analyzing effect of initializing $W_1$ while $W_2=\mathbf{0}$. The experiment uses Cifar10 and blocks in Figure \ref{['fig:id-control']}, and more details are in Appendix \ref{['sec:jacana']}. \ref{['fig:initials']} The initialization methods for $W_1$ in a rectangular format. Fixup: "Random"; ZerO: "Hadamard". And "Partial Identity" and "IDInit" denote padding $\mathbf{0}$ and $I$ to an identity matrix, respectively. \ref{['fig:square-res']} Set $W_1\in \mathbb{R}^{240\times 240}$ and $W_2\in \mathbb{R}^{240\times 240}$ as square matrices. "Identity-1" represents a configuration where only one weight is initialized as $\mathbf{0}$. Interestingly, while "Random" and "Hadamard" methods may outperform "Identity-1" in initial training epochs due to more network weights, they are hard to capture the inductive bias of "Identity-1", resulting in convergence difficulties. In contrast, IDInit can effectively leverage the training dynamics associated with "Identity-1". \ref{['fig:rec-res']} Set $W_1\in \mathbb{R}^{280\times 240}$ and $W_2\in \mathbb{R}^{240\times 280}$ as rectangle matrices. "Default" means $W_1$ and $W_2$ are initialized with Xavier. However, "Default" proves ineffective for training, as it conflicts with dynamical isometry. Furthermore, even though "Partial Identity" exhibits the capability to transmit partial signals, it performs poorly due to rank constraint issues. Finally, IDInit maintains well-training conditions by padding the identity matrix.
  • Figure 3: An overview of IDInit, which consists of identity-preserving initialization $\operatorname{IDI}_{\tau}$ and zero-preserving initialization $\operatorname{IDIZ}_{\varepsilon}$, of which dimensions are denoted as $D^{\mathbf{I}}$ and $D^{\mathbf{0}}$. $\tau$ and $\epsilon$ are usually set to 1 and 1e-6 to maintain identity and transit zero. $i$ and $i+1$ mean two adjacent layer indices.
  • Figure 4: Two padding schemes and their influence on ranks of a layer. We trained a 3-layer network on MNIST, and set $D_0=768$ and $D_h=2048$. We plot $\text{rank}(\Delta\theta^{(1)})\in \mathbb{R}^{D_h \times D_h}$ in \ref{['fig:matrix-rank']}. As shown in \ref{['fig:matrix-rank']}, padding identity can achieve more than a rank of 768 like Hadamard, while padding zero is limited under 768. The loose condition can lead to better rank performance, however, cannot solve the rank constraint problem of padding zero.
  • Figure 5: The last weight in a residual block of a trained ResNet. More than half of elements in \ref{['fig:conv0']} are not trained, which is known as the dead neuron. By contrast, $\operatorname{IDIZ}_{1e-6}$ successfully solves the dead neuron problem and makes all the elements in \ref{['fig:convzero']} trainable.
  • ...and 17 more figures

Theorems & Definitions (3)

  • Theorem 3.1
  • proof
  • proof