Table of Contents
Fetching ...

Neural Characteristic Activation Analysis and Geometric Parameterization for ReLU Networks

Wenlin Chen, Hong Ge

TL;DR

Geometric Parameterization (GmP), a novel neural network parameterization technique that effectively separates the radial and angular components of weights in the hyperspherical coordinate system, is proposed and it is shown theoretically that GmP resolves the aforementioned instability issue.

Abstract

We introduce a novel approach for analyzing the training dynamics of ReLU networks by examining the characteristic activation boundaries of individual ReLU neurons. Our proposed analysis reveals a critical instability in common neural network parameterizations and normalizations during stochastic optimization, which impedes fast convergence and hurts generalization performance. Addressing this, we propose Geometric Parameterization (GmP), a novel neural network parameterization technique that effectively separates the radial and angular components of weights in the hyperspherical coordinate system. We show theoretically that GmP resolves the aforementioned instability issue. We report empirical results on various models and benchmarks to verify GmP's advantages of optimization stability, convergence speed and generalization performance.

Neural Characteristic Activation Analysis and Geometric Parameterization for ReLU Networks

TL;DR

Geometric Parameterization (GmP), a novel neural network parameterization technique that effectively separates the radial and angular components of weights in the hyperspherical coordinate system, is proposed and it is shown theoretically that GmP resolves the aforementioned instability issue.

Abstract

We introduce a novel approach for analyzing the training dynamics of ReLU networks by examining the characteristic activation boundaries of individual ReLU neurons. Our proposed analysis reveals a critical instability in common neural network parameterizations and normalizations during stochastic optimization, which impedes fast convergence and hurts generalization performance. Addressing this, we propose Geometric Parameterization (GmP), a novel neural network parameterization technique that effectively separates the radial and angular components of weights in the hyperspherical coordinate system. We show theoretically that GmP resolves the aforementioned instability issue. We report empirical results on various models and benchmarks to verify GmP's advantages of optimization stability, convergence speed and generalization performance.
Paper Structure (22 sections, 4 theorems, 15 equations, 6 figures, 3 tables)

This paper contains 22 sections, 4 theorems, 15 equations, 6 figures, 3 tables.

Key Result

Proposition 2.3

A perturbation $\mathop{\mathrm{\boldsymbol{\varepsilon}}}\nolimits$ to the weight $\mathop{\mathrm{\mathbf{w}}}\nolimits$ under SP eq:fc-wb-param can result in an arbitrarily large change in the angular direction of the CAB if $\mathop{\mathrm{\mathbf{w}}}\nolimits$ has a similar magnitude to $\mat

Figures (6)

  • Figure 1: (a) Characteristic activation boundary (CAB) $\mathop{\mathrm{\mathcal{B}}}\nolimits$ (brown solid line) and spatial location $\mathop{\mathrm{\boldsymbol{\phi}}}\nolimits=-\lambda\mathop{\mathrm{\mathbf{u}}}\nolimits(\theta)$ of a ReLU unit $z=\mathop{\mathrm{\text{ReLU}}}\nolimits(\mathop{\mathrm{\mathbf{u}}}\nolimits(\theta)^{\mathop{\mathrm{\text{T}}}\nolimits}\mathop{\mathrm{\mathbf{x}}}\nolimits+\lambda)=\mathop{\mathrm{\text{ReLU}}}\nolimits(\cos(\theta)x_1+\sin(\theta)x_2+\lambda)$ for inputs $\mathop{\mathrm{\mathbf{x}}}\nolimits\in\mathop{\mathrm{\mathbb{R}}}\nolimits^2$. The CAB forms a line in $\mathop{\mathrm{\mathbb{R}}}\nolimits^2$, which acts as a boundary separating inputs into two regions. Green arrows denote the active region, and red arrows denote the inactive region. (b)-(e) Stability of the CAB of a ReLU unit in $\mathop{\mathrm{\mathbb{R}}}\nolimits^2$ under small perturbations $\mathop{\mathrm{\boldsymbol{\varepsilon}}}\nolimits=\delta\mathbf{1}$ to the parameters. Solid lines denote characteristic activation boundaries $\mathop{\mathrm{\mathcal{B}}}\nolimits$, and colored dotted lines connect the origin and spatial locations $\mathop{\mathrm{\boldsymbol{\phi}}}\nolimits$ of $\mathop{\mathrm{\mathcal{B}}}\nolimits$. Smaller changes between the perturbed and original boundaries imply higher stability. GmP is most stable against perturbations.
  • Figure 2: (a)-(b) Characteristic activation point $\mathop{\mathrm{\mathcal{B}}}\nolimits$ (intersection of brown solid lines and the x-axis) and spatial location $\phi=-\lambda u(\theta)$ of a ReLU unit $z=\mathop{\mathrm{\text{ReLU}}}\nolimits(u(\theta)x+\lambda)$ (blue solid lines) for inputs $x\in\mathop{\mathrm{\mathbb{R}}}\nolimits$. Green arrows denote active regions, and red arrows denote inactive regions. (c) Evolution dynamics of the characteristic points $\mathop{\mathrm{\mathcal{B}}}\nolimits$ in a one-hidden-layer network with 100 ReLU units for a 1D Levy regression problem under SP, WN, BN and GmP during training. SP stands for standard parameterization, WN stands for weight normalization, BN stands for batch normalization, and GmP stands for geometric parameterization. Smaller values are better as they indicate higher stability of the evolution of the characteristic points during training. The y-axis is in $\log_2$ scale. (d)-(g): The top row illustrates the experimental setup, including the network's predictions at initialization and after training, and the training data and the ground-truth function (Levy). Bottom row: the evolution of the characteristic activation point for the 100 ReLU units during training. Each horizontal bar shows the spatial location spectrum for a chosen optimization step, moving from the bottom (at initialization) to the top (after training with Adam). More spread of the spatial locations covers the data better and adds more useful non-linearities to the model, making prediction more accurate. Regression accuracy is measured by root mean squared error (RMSE) on a separate test set. Smaller RMSE values are better. We use cross-validation to select the learning rate for each method. The optimal learning rate for SP, WN, and BN is lower than that for GmP, since their training becomes unstable with higher learning rates, as shown in (c).
  • Figure 3: Performance of a single-hidden-layer neural network with 10 ReLU units on the 2D Banana classification dataset under SP, WN, BN and GmP trained using Adam. SP stands for standard parameterization, WN stands for weight normalization, BN stands for batch normalization, and GmP stands for geometric parameterization. (a)-(h): Trajectories of the spatial locations of the 10 ReLU units during training. Each color depicts one ReLU unit. Smoother evolution means higher training stability. The evolution under GmP is stable, so we can use a $10\times$ larger learning rate. (i): Evolution dynamics of the angular direction $\theta$ of CABs. Smaller values are better as they indicate higher robustness against stochastic gradient noise. (j)-(m): Network predictions after training. Black bold lines depict the classification boundary between two classes. Classification accuracy is measured on a separate test set. Higher accuracy values are better. The red stars show the spatial locations of 10 ReLU units. Intuitively speaking, more evenly spread out red stars are better for classification accuracy, as they provide more useful non-linearity.
  • Figure 4: Convergence speed for VGG-6 trained on the ImageNet32 dataset with batch size 1024.
  • Figure 5: Visualization of characteristic activation boundaries (brown solid lines) and spatial locations $\phi=-\lambda u(\theta)$ of a ReLU unit $z=\mathop{\mathrm{\text{ReLU}}}\nolimits(u(\theta)x+\lambda)$ (blue solid lines) for inputs $x\in\mathop{\mathrm{\mathbb{R}}}\nolimits$. Green arrows denote active regions and red arrows denote inactive regions.
  • ...and 1 more figures

Theorems & Definitions (15)

  • Definition 2.1: CAB
  • Definition 2.2: Spatial location
  • Proposition 2.3: Instability of SP
  • proof
  • Proposition 2.4: Instability of WN
  • proof
  • Proposition 2.5: Instability of BN
  • proof
  • Definition 3.1
  • Definition 3.2
  • ...and 5 more