Table of Contents
Fetching ...

Path-conditioned training: a principled way to rescale ReLU neural networks

Arthur Lebeurrier, Titouan Vayer, Rémi Gribonval

TL;DR

This work introduces a geometrically motivated criterion to rescale neural network parameters which minimization leads to a conditioning strategy that aligns a kernel in the path-lifting space with a chosen reference and derives an efficient algorithm to perform this alignment.

Abstract

Despite recent algorithmic advances, we still lack principled ways to leverage the well-documented rescaling symmetries in ReLU neural network parameters. While two properly rescaled weights implement the same function, the training dynamics can be dramatically different. To offer a fresh perspective on exploiting this phenomenon, we build on the recent path-lifting framework, which provides a compact factorization of ReLU networks. We introduce a geometrically motivated criterion to rescale neural network parameters which minimization leads to a conditioning strategy that aligns a kernel in the path-lifting space with a chosen reference. We derive an efficient algorithm to perform this alignment. In the context of random network initialization, we analyze how the architecture and the initialization scale jointly impact the output of the proposed method. Numerical experiments illustrate its potential to speed up training.

Path-conditioned training: a principled way to rescale ReLU neural networks

TL;DR

This work introduces a geometrically motivated criterion to rescale neural network parameters which minimization leads to a conditioning strategy that aligns a kernel in the path-lifting space with a chosen reference and derives an efficient algorithm to perform this alignment.

Abstract

Despite recent algorithmic advances, we still lack principled ways to leverage the well-documented rescaling symmetries in ReLU neural network parameters. While two properly rescaled weights implement the same function, the training dynamics can be dramatically different. To offer a fresh perspective on exploiting this phenomenon, we build on the recent path-lifting framework, which provides a compact factorization of ReLU networks. We introduce a geometrically motivated criterion to rescale neural network parameters which minimization leads to a conditioning strategy that aligns a kernel in the path-lifting space with a chosen reference. We derive an efficient algorithm to perform this alignment. In the context of random network initialization, we analyze how the architecture and the initialization scale jointly impact the output of the proposed method. Numerical experiments illustrate its potential to speed up training.
Paper Structure (45 sections, 16 theorems, 100 equations, 8 figures, 1 table, 3 algorithms)

This paper contains 45 sections, 16 theorems, 100 equations, 8 figures, 1 table, 3 algorithms.

Key Result

Lemma 4.1

If $g > 0$, for any neuron $h$ the problem $\min_{u_h \in {\mathbb{R}}} \ F(u_1, \cdots, u_h, \cdots, u_H)$ has a solution given by $\log(r_h)$ where $r_h$ is the unique positive root of the polynomial $\mathcal{B}(\mathcal{A} + p) X^2 + \mathcal{A} \mathcal{D} X + \mathcal{C}(\mathcal{A}-p)$ where,

Figures (8)

  • Figure 1: GD for a toy model $f_{\theta=(u,v,w)}(x) = u\operatorname{ReLU}(vx+w)$ on a loss $L(\theta)$ that can be factorized as $L(\theta) = \ell(\Phi(\theta))$ (see \ref{['sec:sketch_simple_example']}). (Left) Loss $L(\theta)$ during GD iterations for three different initializations $\theta_0$ (three colors). Dashed lines correspond to GD starting at $\theta_0$, bold lines to GD starting at rescaled $\theta_0^{(\lambda)} \sim \theta_0$ using PathCond ; (Middle) Trajectories in lifted space $\Phi(\theta) = (uv, uw)^\top$. Dotted lines are trajectories corresponding to $\partial_t \Phi = - \nabla_\Phi \ell(\Phi)$ (GD on $\ell(\Phi)$). (Right) Trajectories in parameter space.
  • Figure 2: PathCond performance comparison across network depths on CIFAR-10 with multilayer perceptrons (Left) Number of epochs required to reach $99\%$ training accuracy for networks with $2$ to $8$ hidden layers (abscissa = number of parameters, which increases with depth). (Middle) Training accuracy curves for the $3$-hidden-layer network. (Right) Corresponding training loss curves
  • Figure 3: Training dynamics on CIFAR-10 with fully convolutional architecture (CIFAR-NV). (Left) Training loss, (Middle) training accuracy, and (Right) test accuracy.
  • Figure 4: Analysis of the relationship between architectural balance (controlled by the maximum width ratio $\max\frac{n_i}{n_j}$) and log-rescaling magnitude for small and large variance regimes.
  • Figure 5: Effect of compression on MNIST autoencoder training. (Top) Final training loss for different compression factors. (Bottom) Maximum absolute value of the log rescaling at initialization.
  • ...and 3 more figures

Theorems & Definitions (36)

  • Lemma 4.1
  • Proposition 5.1: Expected diagonal under standard initialization
  • Lemma 1.1
  • proof
  • Definition 4.1: Neuron-wise rescaling symmetry
  • Definition 4.2: Rescaling symmetry group
  • Lemma 4.3
  • proof
  • Definition 4.4: Path in a DAG
  • Definition 4.5: Path lifting
  • ...and 26 more