Table of Contents
Fetching ...

A Lightweight and Gradient-Stable Neural Layer

Yueyao Yu, Yin Zhang

TL;DR

A neural-layer architecture based on Householder weighting and absolute-value activating, called Householder-absolute neural layer or simply Han-layer is proposed, which reduces the number of parameters and the corresponding computational complexity from O(d2) to O(d).

Abstract

To enhance resource efficiency and model deployability of neural networks, we propose a neural-layer architecture based on Householder weighting and absolute-value activating, called Householder-absolute neural layer or simply Han-layer. Compared to a fully connected layer with $d$-neurons and $d$ outputs, a Han-layer reduces the number of parameters and the corresponding computational complexity from $O(d^2)$ to $O(d)$. {The Han-layer structure guarantees that the Jacobian of the layer function is always orthogonal, thus ensuring gradient stability (i.e., free of gradient vanishing or exploding issues) for any Han-layer sub-networks.} Extensive numerical experiments show that one can strategically use Han-layers to replace fully connected (FC) layers, reducing the number of model parameters while maintaining or even improving the generalization performance. We will also showcase the capabilities of the Han-layer architecture on a few small stylized models, and discuss its current limitations.

A Lightweight and Gradient-Stable Neural Layer

TL;DR

A neural-layer architecture based on Householder weighting and absolute-value activating, called Householder-absolute neural layer or simply Han-layer is proposed, which reduces the number of parameters and the corresponding computational complexity from O(d2) to O(d).

Abstract

To enhance resource efficiency and model deployability of neural networks, we propose a neural-layer architecture based on Householder weighting and absolute-value activating, called Householder-absolute neural layer or simply Han-layer. Compared to a fully connected layer with -neurons and outputs, a Han-layer reduces the number of parameters and the corresponding computational complexity from to . {The Han-layer structure guarantees that the Jacobian of the layer function is always orthogonal, thus ensuring gradient stability (i.e., free of gradient vanishing or exploding issues) for any Han-layer sub-networks.} Extensive numerical experiments show that one can strategically use Han-layers to replace fully connected (FC) layers, reducing the number of model parameters while maintaining or even improving the generalization performance. We will also showcase the capabilities of the Han-layer architecture on a few small stylized models, and discuss its current limitations.

Paper Structure

This paper contains 28 sections, 5 theorems, 17 equations, 12 figures, 14 tables.

Key Result

Lemma 1

Let $\phi: \mathbb{R}\rightarrow\mathbb{R}$ have Property A and $|\mathcal{C}_{\phi}|$ denote the cardinality of $\mathcal{C}_{\phi}$. Then $|\mathcal{C}_{\phi}| \ge 1$.

Figures (12)

  • Figure 1: Landscapes and top views for FC and Han models on the checkerboard dataset: (a) FCNet, and (b) HanNet.
  • Figure 2: Visualization of an MLP-layer with ReLU activation and a Han-layer with ABS activation. On the left, $W$ is $d \times d$ and $Wx$ requires $O(d^2)$ operations, while on the right $u$ is a nonzero $d$-vector and the multiplication with $x$ requires only $O(d)$ operations.
  • Figure 3: Landscape of $\|F_{FC}(x)\|_2$ and $\|F_{Han}(x)\|_2$ in one instance, where $d=50$.
  • Figure 4: The average root mean squared error (RMSE) on 5 instances. Blue line: FC-1 approximates itself, red line: Han-3 approximates FC-1, blue dash line: FC-1 approximates Han-3, red dash line: Han-3 approximates itself.
  • Figure 5: Checkerboard datasets. In the right figure, the dots represent the training set (25%).
  • ...and 7 more figures

Theorems & Definitions (8)

  • Definition 1
  • Lemma 1
  • Lemma 2
  • proof
  • Proposition 1
  • Proposition 2
  • proof
  • Proposition 3