Table of Contents
Fetching ...

It's not a Lottery, it's a Race: Understanding How Gradient Descent Adapts the Network's Capacity to the Task

Hannah Pinson

TL;DR

This work tackles why gradient descent compresses overparameterized networks to an effective capacity that fits tasks, by dissecting neuron-level learning dynamics in a single-hidden-layer ReLU network trained for binary classification. It introduces and analyzes three dynamical principles—mutual alignment, unlocking, and racing—that together explain how gradient flow can merge equivalent neurons and prune low-norm weights, thereby realizing the lottery-ticket effect as an emergent race among neurons rather than a static lottery. The authors derive gradient equations under gating, characterize fixed points for weight directions, and show how norm growth is exponentially amplified for neurons closer to their target directions, leading to early winners that dominate learning and pruning of the rest. Experiments on CIFAR-10-derived binaries validate the theory, demonstrating predictive early angular distances for final norms and showing substantial neuron-merging under small initialization, with implications for understanding capacity control and sparsity in larger networks.

Abstract

Our theoretical understanding of neural networks is lagging behind their empirical success. One of the important unexplained phenomena is why and how, during the process of training with gradient descent, the theoretical capacity of neural networks is reduced to an effective capacity that fits the task. We here investigate the mechanism by which gradient descent achieves this through analyzing the learning dynamics at the level of individual neurons in single hidden layer ReLU networks. We identify three dynamical principles -- mutual alignment, unlocking and racing -- that together explain why we can often successfully reduce capacity after training through the merging of equivalent neurons or the pruning of low norm weights. We specifically explain the mechanism behind the lottery ticket conjecture, or why the specific, beneficial initial conditions of some neurons lead them to obtain higher weight norms.

It's not a Lottery, it's a Race: Understanding How Gradient Descent Adapts the Network's Capacity to the Task

TL;DR

This work tackles why gradient descent compresses overparameterized networks to an effective capacity that fits tasks, by dissecting neuron-level learning dynamics in a single-hidden-layer ReLU network trained for binary classification. It introduces and analyzes three dynamical principles—mutual alignment, unlocking, and racing—that together explain how gradient flow can merge equivalent neurons and prune low-norm weights, thereby realizing the lottery-ticket effect as an emergent race among neurons rather than a static lottery. The authors derive gradient equations under gating, characterize fixed points for weight directions, and show how norm growth is exponentially amplified for neurons closer to their target directions, leading to early winners that dominate learning and pruning of the rest. Experiments on CIFAR-10-derived binaries validate the theory, demonstrating predictive early angular distances for final norms and showing substantial neuron-merging under small initialization, with implications for understanding capacity control and sparsity in larger networks.

Abstract

Our theoretical understanding of neural networks is lagging behind their empirical success. One of the important unexplained phenomena is why and how, during the process of training with gradient descent, the theoretical capacity of neural networks is reduced to an effective capacity that fits the task. We here investigate the mechanism by which gradient descent achieves this through analyzing the learning dynamics at the level of individual neurons in single hidden layer ReLU networks. We identify three dynamical principles -- mutual alignment, unlocking and racing -- that together explain why we can often successfully reduce capacity after training through the merging of equivalent neurons or the pruning of low norm weights. We specifically explain the mechanism behind the lottery ticket conjecture, or why the specific, beneficial initial conditions of some neurons lead them to obtain higher weight norms.
Paper Structure (27 sections, 6 theorems, 60 equations, 9 figures)

This paper contains 27 sections, 6 theorems, 60 equations, 9 figures.

Key Result

Theorem 4.1

For a fixed gating pattern $\vec{g}_\alpha$, the target direction for the outgoing weight vector $\phi_{\alpha}^*$ is given by:

Figures (9)

  • Figure 1: Experimental results for a binary classification task based on the CIFAR10 dataset. a) Evolution of the training loss. b) Cosine similarity matrices between the total parameter vectors of individual neurons at different timesteps during training. Over time, individual neurons start to (mutually) align to shared target directions. (Neuron ordering is different from matrix to matrix). c) Illustration of how well the angular distance $\delta_\alpha$ of the incoming weight vector of a neuron $\alpha$ to its target direction early in training predicts the neuron's final total norm, denoted $a_{\alpha} = ||\vec{w}^{(2)}_{\alpha}|| ||\vec{w}^{(1)}_{\alpha}||$. Upper row shows the results at initialization (t=0), the lower row at a timepoint early in training (t=500), before the loss drops. In the left column, we plot the current value $a_\alpha$ in function of cos$(\delta_a)$. In the right column, we plot the final value for $a_\alpha$ in function of the same values cos$(\delta_a)$, i.e., the values of cos$(\delta_a)$ at t=0 and t=500. In both columns, neurons have a darker color if they have a higher final norm. This shows that we can, to a large degree, predict which neurons will obtain a higher final norm from looking at the angular distance to their target direction at initialization (upper row). We argue that this arises because, in the early phase of learning, neurons that arrive earlier at their target direction (because they were oriented favorably to start with) grow exponentially faster in norm. This can be seen in the lower row, and is explained in our theoretical results.
  • Figure 2: Visualization of alignment for a simple dataset XOR-like dataset (illustrated in fig. \ref{['app_fig:data_xor']}), but where the centroids of the 4 different clusters have 4 different norms. a) Training loss, showing 4 drops in loss. b) Visualization of the directions of the vectors of incoming weights (upper row) and vectors of outgoing weights (lower row) of each of the neurons during training. Vectors are colored according to which final direction they align to. The alignment process takes place at 4 different speeds (light blue is the fastest, red the slowest), corresponding to 4 different subsystems of neurons. For this simple dataset, the vectors remain aligned after the initial drops in the loss.
  • Figure 3: Ratio $\frac{\partial L^s}{ \partial (\phi_{\alpha})} / \frac{\partial L^s}{ \partial ||\vec{w}^{(2)}_{\alpha}||}$ in function of $\frac{3\pi}{4}- \phi_{\alpha}$ for the XOR-like experiment with 1000 hidden neurons at t=0. Note the values on the y-axis.
  • Figure 4: Illustration of the alignment of incoming weight vectors during training for different initialization scales.
  • Figure 5: Illustration of the two larger subsystems of neurons evolving at different speeds. For the neurons which align to $\phi_{\alpha}^{2*}$ we used the negative of their cos$(\delta_\alpha)$ values, to be able to compare the groups as the left and rigth hand side within the same plot.
  • ...and 4 more figures

Theorems & Definitions (6)

  • Theorem 4.1
  • Proposition 4.2
  • Proposition 4.3
  • Proposition 4.4
  • Proposition 4.5
  • Proposition 4.6