Table of Contents
Fetching ...

On the distance between two neural networks and the stability of learning

Jeremy Bernstein, Arash Vahdat, Yisong Yue, Ming-Yu Liu

TL;DR

This work addresses the persistent challenge of learning-rate tuning in deep networks by introducing a distance on neural networks, called deep relative trust, that captures how perturbations to layer parameters propagate through the network’s compositional structure. Building on this, the authors derive a neural-network–specific descent lemma and propose Frobenius matched gradient descent (Fromage), a per-layer update with a single interpretable hyperparameter governing relative perturbation size. Theoretical results reveal a quasi-exponential trust region with depth and a product-form bound on gradient breakdown, while empirical studies show Fromage performs robustly across MNIST, CIFAR-10, ImageNet, GANs, and transformers with minimal LR tuning. Collectively, the paper offers a principled framework to stabilize training of deep networks and simplify optimization workflows, with open-source code provided for replication.

Abstract

This paper relates parameter distance to gradient breakdown for a broad class of nonlinear compositional functions. The analysis leads to a new distance function called deep relative trust and a descent lemma for neural networks. Since the resulting learning rule seems to require little to no learning rate tuning, it may unlock a simpler workflow for training deeper and more complex neural networks. The Python code used in this paper is here: https://github.com/jxbz/fromage.

On the distance between two neural networks and the stability of learning

TL;DR

This work addresses the persistent challenge of learning-rate tuning in deep networks by introducing a distance on neural networks, called deep relative trust, that captures how perturbations to layer parameters propagate through the network’s compositional structure. Building on this, the authors derive a neural-network–specific descent lemma and propose Frobenius matched gradient descent (Fromage), a per-layer update with a single interpretable hyperparameter governing relative perturbation size. Theoretical results reveal a quasi-exponential trust region with depth and a product-form bound on gradient breakdown, while empirical studies show Fromage performs robustly across MNIST, CIFAR-10, ImageNet, GANs, and transformers with minimal LR tuning. Collectively, the paper offers a principled framework to stabilize training of deep networks and simplify optimization workflows, with open-source code provided for replication.

Abstract

This paper relates parameter distance to gradient breakdown for a broad class of nonlinear compositional functions. The analysis leads to a new distance function called deep relative trust and a descent lemma for neural networks. Since the resulting learning rule seems to require little to no learning rate tuning, it may unlock a simpler workflow for training deeper and more complex neural networks. The Python code used in this paper is here: https://github.com/jxbz/fromage.

Paper Structure

This paper contains 11 sections, 8 theorems, 47 equations, 4 figures, 2 tables, 1 algorithm.

Key Result

Proposition 1

Consider a multilayer perceptron with $L$ layers. The layer-$l$-to-output Jacobian $J_l$ is given by: where $\Phi^\prime_k:=\mathrm{diag}[\varphi^\prime(W_k\, h_{k-1}(x))]$ denotes the derivative of the nonlinearity at the $k$th layer.

Figures (4)

  • Figure 1: Using Fromage, we train a 2-layer (left) and 16-layer (right) perceptron to classify the MNIST dataset. With the network frozen at ten different training checkpoints, we first compute the gradient of the $l$th layer $g_l$ using the full data batch. We then record the loss and full batch gradient $\widetilde{g}_l$ after perturbing all weight matrices $W_l$ ($l=1,...,L$) to $W_l -\eta \cdot g_l \frac{\Vert {W_l} \Vert_F}{\Vert {g_l} \Vert_F}$ for various perturbation strengths $\eta$. We plot the classification loss and the relative change in gradient of the input layer $\Vert {\widetilde{g}_1 - g_1} \Vert_F/\Vert {g_1} \Vert_F$ along these parameter slices, all on a log scale. We find that the loss and relative change in gradient grow quasi-exponentially when the perceptron is deep, suggesting that Euclidean trust is violated. As such, these results are more consistent with our notion of deep relative trust.
  • Figure 2: Fromage---a good default $\eta=0.01$.
  • Figure 3: Training multilayer perceptrons at depths challenging for existing optimisers. We train multilayer perceptrons of depth $L$ on the MNIST dataset. At each depth, we plot the training accuracy after 100 epochs. Left: for each algorithm, we plot the best performing run over 3 learning rate settings found to be appropriate for that algorithm. We also plot trend lines to help guide the eye. Right: the Fromage results are presented for each learning rate setting. Since for deeper networks a smaller value of $\eta$ was needed in Fromage, these results provide partial support for Lemma \ref{['thm:relative']}.
  • Figure 4: Learning rate tuning for standard benchmarks. For each learning rate setting $\eta$, we plot the error at the best tuned $\eta$ divided by the error for that $\eta$, so that a value of 1.0 corresponds to the best learning rate setting for that task. For Fromage, the setting of $\eta=0.01$ was optimal across all tasks.

Theorems & Definitions (14)

  • Definition 1: Multilayer perceptron
  • Proposition 1: Jacobian of the multilayer perceptron
  • Theorem 1
  • Lemma 1
  • Lemma 2
  • Lemma 2
  • proof
  • Lemma 2
  • proof
  • Theorem 1
  • ...and 4 more