It's not a Lottery, it's a Race: Understanding How Gradient Descent Adapts the Network's Capacity to the Task
Hannah Pinson
TL;DR
This work tackles why gradient descent compresses overparameterized networks to an effective capacity that fits tasks, by dissecting neuron-level learning dynamics in a single-hidden-layer ReLU network trained for binary classification. It introduces and analyzes three dynamical principles—mutual alignment, unlocking, and racing—that together explain how gradient flow can merge equivalent neurons and prune low-norm weights, thereby realizing the lottery-ticket effect as an emergent race among neurons rather than a static lottery. The authors derive gradient equations under gating, characterize fixed points for weight directions, and show how norm growth is exponentially amplified for neurons closer to their target directions, leading to early winners that dominate learning and pruning of the rest. Experiments on CIFAR-10-derived binaries validate the theory, demonstrating predictive early angular distances for final norms and showing substantial neuron-merging under small initialization, with implications for understanding capacity control and sparsity in larger networks.
Abstract
Our theoretical understanding of neural networks is lagging behind their empirical success. One of the important unexplained phenomena is why and how, during the process of training with gradient descent, the theoretical capacity of neural networks is reduced to an effective capacity that fits the task. We here investigate the mechanism by which gradient descent achieves this through analyzing the learning dynamics at the level of individual neurons in single hidden layer ReLU networks. We identify three dynamical principles -- mutual alignment, unlocking and racing -- that together explain why we can often successfully reduce capacity after training through the merging of equivalent neurons or the pruning of low norm weights. We specifically explain the mechanism behind the lottery ticket conjecture, or why the specific, beneficial initial conditions of some neurons lead them to obtain higher weight norms.
