Path-conditioned training: a principled way to rescale ReLU neural networks

Arthur Lebeurrier; Titouan Vayer; Rémi Gribonval

Path-conditioned training: a principled way to rescale ReLU neural networks

Arthur Lebeurrier, Titouan Vayer, Rémi Gribonval

TL;DR

This work introduces a geometrically motivated criterion to rescale neural network parameters which minimization leads to a conditioning strategy that aligns a kernel in the path-lifting space with a chosen reference and derives an efficient algorithm to perform this alignment.

Abstract

Despite recent algorithmic advances, we still lack principled ways to leverage the well-documented rescaling symmetries in ReLU neural network parameters. While two properly rescaled weights implement the same function, the training dynamics can be dramatically different. To offer a fresh perspective on exploiting this phenomenon, we build on the recent path-lifting framework, which provides a compact factorization of ReLU networks. We introduce a geometrically motivated criterion to rescale neural network parameters which minimization leads to a conditioning strategy that aligns a kernel in the path-lifting space with a chosen reference. We derive an efficient algorithm to perform this alignment. In the context of random network initialization, we analyze how the architecture and the initialization scale jointly impact the output of the proposed method. Numerical experiments illustrate its potential to speed up training.

Path-conditioned training: a principled way to rescale ReLU neural networks

TL;DR

Abstract

Paper Structure (45 sections, 16 theorems, 100 equations, 8 figures, 1 table, 3 algorithms)

This paper contains 45 sections, 16 theorems, 100 equations, 8 figures, 1 table, 3 algorithms.

Introduction
Sketch of the idea on a simple example
Rescaling symmetries and the path-lifting
Path-conditioned training
Rescaling is pre-conditioning
Proposed rescaling criterion
Explicit algorithm with the $\operatorname{logdet}$ divergence
Computational complexity.
Experiments
Faster Training Dynamics
Generalization
What are the Favorable Regimes for PathCond ?
Conclusion
Properties of the path-lifting
Gradient in $\Phi$-space
...and 30 more sections

Key Result

Lemma 4.1

If $g > 0$, for any neuron $h$ the problem $\min_{u_h \in {\mathbb{R}}} \ F(u_1, \cdots, u_h, \cdots, u_H)$ has a solution given by $\log(r_h)$ where $r_h$ is the unique positive root of the polynomial $\mathcal{B}(\mathcal{A} + p) X^2 + \mathcal{A} \mathcal{D} X + \mathcal{C}(\mathcal{A}-p)$ where,

Figures (8)

Figure 1: GD for a toy model $f_{\theta=(u,v,w)}(x) = u\operatorname{ReLU}(vx+w)$ on a loss $L(\theta)$ that can be factorized as $L(\theta) = \ell(\Phi(\theta))$ (see \ref{['sec:sketch_simple_example']}). (Left) Loss $L(\theta)$ during GD iterations for three different initializations $\theta_0$ (three colors). Dashed lines correspond to GD starting at $\theta_0$, bold lines to GD starting at rescaled $\theta_0^{(\lambda)} \sim \theta_0$ using PathCond ; (Middle) Trajectories in lifted space $\Phi(\theta) = (uv, uw)^\top$. Dotted lines are trajectories corresponding to $\partial_t \Phi = - \nabla_\Phi \ell(\Phi)$ (GD on $\ell(\Phi)$). (Right) Trajectories in parameter space.
Figure 2: PathCond performance comparison across network depths on CIFAR-10 with multilayer perceptrons (Left) Number of epochs required to reach $99\%$ training accuracy for networks with $2$ to $8$ hidden layers (abscissa = number of parameters, which increases with depth). (Middle) Training accuracy curves for the $3$-hidden-layer network. (Right) Corresponding training loss curves
Figure 3: Training dynamics on CIFAR-10 with fully convolutional architecture (CIFAR-NV). (Left) Training loss, (Middle) training accuracy, and (Right) test accuracy.
Figure 4: Analysis of the relationship between architectural balance (controlled by the maximum width ratio $\max\frac{n_i}{n_j}$) and log-rescaling magnitude for small and large variance regimes.
Figure 5: Effect of compression on MNIST autoencoder training. (Top) Final training loss for different compression factors. (Bottom) Maximum absolute value of the log rescaling at initialization.
...and 3 more figures

Theorems & Definitions (36)

Lemma 4.1
Proposition 5.1: Expected diagonal under standard initialization
Lemma 1.1
proof
Definition 4.1: Neuron-wise rescaling symmetry
Definition 4.2: Rescaling symmetry group
Lemma 4.3
proof
Definition 4.4: Path in a DAG
Definition 4.5: Path lifting
...and 26 more

Path-conditioned training: a principled way to rescale ReLU neural networks

TL;DR

Abstract

Path-conditioned training: a principled way to rescale ReLU neural networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (36)