Improved weight initialization for deep and narrow feedforward neural network

Hyunwoo Lee; Yunho Kim; Seung Yeop Yang; Hayoung Choi

Improved weight initialization for deep and narrow feedforward neural network

Hyunwoo Lee, Yunho Kim, Seung Yeop Yang, Hayoung Choi

TL;DR

This paper tackles the dying ReLU problem in extremely deep and narrow FFNNs by introducing a fully deterministic weight initialization. The method constructs a weight matrix using a QR-based orthogonalization of a perturbed all-ones matrix, enabling property guarantees such as orthogonality and balanced row/column sums. The authors prove key properties and demonstrate depth, width, and activation independence, with an algorithmic construction that scales to large networks. Empirical results on MNIST, Fashion-MNIST, and select tabular datasets show improved convergence and higher validation accuracy in deep and narrow architectures compared to multiple baselines, indicating practical robustness and batch-normalization-free training.

Abstract

Appropriate weight initialization settings, along with the ReLU activation function, have become cornerstones of modern deep learning, enabling the training and deployment of highly effective and efficient neural network models across diverse areas of artificial intelligence. The problem of \textquotedblleft dying ReLU," where ReLU neurons become inactive and yield zero output, presents a significant challenge in the training of deep neural networks with ReLU activation function. Theoretical research and various methods have been introduced to address the problem. However, even with these methods and research, training remains challenging for extremely deep and narrow feedforward networks with ReLU activation function. In this paper, we propose a novel weight initialization method to address this issue. We establish several properties of our initial weight matrix and demonstrate how these properties enable the effective propagation of signal vectors. Through a series of experiments and comparisons with existing methods, we demonstrate the effectiveness of the novel initialization method.

Improved weight initialization for deep and narrow feedforward neural network

TL;DR

Abstract

Paper Structure (14 sections, 7 theorems, 36 equations, 5 figures, 2 tables)

This paper contains 14 sections, 7 theorems, 36 equations, 5 figures, 2 tables.

Introduction
Methodology
Basic Conceptions
Prior Work
Proposed Weight Initialization Method
Properties of the proposed initial weight matrix
Experimental results
Experimental Settings
Prior Weight Initialization Method for FFNNs
Experiments in Various Settings
Depth Independent
Width Independent
Activation Independent
Conclusion

Key Result

Proposition 1

Let $\mathbf{q}_1,\ldots,\mathbf{q}_m$ be the column vectors of $\mathbf{Q}_{m\times m}^{\epsilon}$ and $\hat{\mathbf{q}}_1,\ldots,\hat{\mathbf{q}}_n$ be the column vectors of $\mathbf{Q}_{n\times n}^{\epsilon}$. Then it holds that

Figures (5)

Figure 1: A proposed initial weight matrix $\mathbf{W}^{\epsilon}_{20\times 40}$ is shown via heatmap ($\epsilon=0.01$). There exists a certain pattern of values for entries of $\mathbf{W}^{\epsilon}_{20\times 40}$.
Figure 2: This shows its effectiveness of positive signal propagation for each weight matrices $\mathbf{W} \in \mathbb{R}^{200 \times 100}$. For 25 random vectors $\mathbf{x} \in \mathbb{R}^{100}$, the entry values of $\mathbf{Wx}$ are plotted. Here, the $x$-axis represents the indices of all entries.
Figure 3: Validation accuracy for FFNNs with ReLU activation is presented across varying depths to explore layer independence. (a) and (b) investigate networks where all hidden layers maintain the same dimension. (c), (d), and (e) investigate networks consisting of a layer with 10 nodes and a layer with 6 nodes, repeated throughout the structure.
Figure 4: A validation accuracy is presented for FFNNs with two hidden layers and ReLU activation function. The $y$-axis (resp. $x$-axis) presents the number of nodes in the first (resp. second) hidden layer. Each is trained on MNIST dataset for 10 epochs.
Figure 5: A validation accuracy is presented for FFNNs with two hidden layers and ReLU activation function. The $y$-axis (resp. $x$-axis) presents the number of nodes in the first (resp. second) hidden layer. Each is trained on FMNIST dataset for 1 epoch.

Theorems & Definitions (15)

Remark
Example 1
Example 2
Proposition 1
Theorem 1
proof
Lemma 1
Lemma 2
proof
Proposition 2
...and 5 more

Improved weight initialization for deep and narrow feedforward neural network

TL;DR

Abstract

Improved weight initialization for deep and narrow feedforward neural network

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (15)