Linearly Constrained Weights: Reducing Activation Shift for Faster Training of Neural Networks

Takuro Kutsuna

Linearly Constrained Weights: Reducing Activation Shift for Faster Training of Neural Networks

Takuro Kutsuna

TL;DR

Experimental results show that LCW enables a deep feedforward network with sigmoid activation functions to be trained efficiently by resolving the vanishing gradient problem and combined with batch normalization, LCW improves generalization performance of both feedforward and convolutional networks.

Abstract

In this paper, we first identify activation shift, a simple but remarkable phenomenon in a neural network in which the preactivation value of a neuron has non-zero mean that depends on the angle between the weight vector of the neuron and the mean of the activation vector in the previous layer. We then propose linearly constrained weights (LCW) to reduce the activation shift in both fully connected and convolutional layers. The impact of reducing the activation shift in a neural network is studied from the perspective of how the variance of variables in the network changes through layer operations in both forward and backward chains. We also discuss its relationship to the vanishing gradient problem. Experimental results show that LCW enables a deep feedforward network with sigmoid activation functions to be trained efficiently by resolving the vanishing gradient problem. Moreover, combined with batch normalization, LCW improves generalization performance of both feedforward and convolutional networks.

Linearly Constrained Weights: Reducing Activation Shift for Faster Training of Neural Networks

TL;DR

Abstract

Paper Structure (17 sections, 6 theorems, 10 equations, 6 figures, 1 table)

This paper contains 17 sections, 6 theorems, 10 equations, 6 figures, 1 table.

Introduction
Activation Shift
Linearly Constrained Weights
Learning LCW via Reparameterization
LCW for Convolutional Layers
Variance Analysis
Variance Analysis of a Fully Connected Layer
Variance Analysis of a Nonlinear Activation Layer
Relationship to the Vanishing Gradient Problem
Example
Related work
Experiments
Deep MLP with Sigmoid Activation Functions
Deep Convolutional Networks with ReLU Activation Functions
Conclusion
...and 2 more sections

Key Result

Proposition 1

Assume that the activation vector $\bm{a}^{l-1}$ follows $\mathcal{P}_{\gamma}$. Given a weight vector $\bm{w}_i^l \in \mathbb{R}^m$ such that $\|\bm{w}_i^l\| > 0$, the expected value of $\bm{w}_i^l \cdot \bm{a}^{l-1}$ is $|\gamma| \sqrt{m} \|\bm{w}_i^l\| \cos \theta_i^l$, where $\theta_i^l$ is the

Figures (6)

Figure 1: Activation shift causes a horizontal stripe pattern in preactivation $\bm{Z}^l = \bm{W}^l \bm{A}^{l-1}$, in which each element of $\bm{W}^l$ and $\bm{A}^{l-1}$ is randomly generated from the range $(-1,1)$ and $(0,1)$, respectively.
Figure 2: Boxplot summaries of $a_i^l$ on the first 20 neurons in layers 1,5, and 9 of the 10-layer sigmoid MLP with LCW.
Figure 3: Boxplot summaries of $a_i^l$ on neurons in layers 1,5, and 9 of the 10-layer sigmoid MLP without LCW, in which weights are initialized by the method in glorot2010understanding.
Figure 4: Boxplot summaries of the preactivation (top) and its gradient (bottom) in 20-layered sigmoid MLPs with standard weights (a) and LCWs (b).
Figure 5: Training loss (upper left), test loss (upper right), training accuracy (lower left), and test accuracy (lower right) of 50-layer MLPs for CIFAR-10 (a) and CIFAR-100 (b).
...and 1 more figures

Theorems & Definitions (16)

Definition 1
Proposition 1
Definition 2
Proposition 2
Definition 3
Proposition 3
Proposition 4
Proposition 5
Proposition 6
proof : Proof of Proposition \ref{['prop1']}
...and 6 more

Linearly Constrained Weights: Reducing Activation Shift for Faster Training of Neural Networks

TL;DR

Abstract

Linearly Constrained Weights: Reducing Activation Shift for Faster Training of Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (16)