Topological Invariance and Breakdown in Learning
Yongyi Yang, Tomaso Poggio, Isaac Chuang, Liu Ziyin
TL;DR
This work analyzes learning dynamics under permutation-equivariant updates and proves a universal topological constraint on neuron configurations: for small steps, updates induce a bi-Lipschitz (hence topology-preserving) mapping between neurons, while large steps permit topological simplifications that reduce expressivity. The theory identifies a topological critical point η^* = 1/K, separating topology-preserving and topology-changing phases, and connects these phases to the edge-of-stability phenomenon observed in practice. Importantly, the results do not depend on specific architectures or losses, yielding a universal framework where permutation symmetry enforces topology on neuron distributions and densities through measure-preserving updates when below the critical rate. The paper further demonstrates the applicability to Gradient Descent and Adam, provides experimental illustrations with Betti numbers on low-dimensional neuron manifolds and real tasks, and discusses implications for learning-rate schedules and deep-learning theory beyond traditional NTK/mean-field pictures.
Abstract
We prove that for a broad class of permutation-equivariant learning rules (including SGD, Adam, and others), the training process induces a bi-Lipschitz mapping between neurons and strongly constrains the topology of the neuron distribution during training. This result reveals a qualitative difference between small and large learning rates $η$. With a learning rate below a topological critical point $η^*$, the training is constrained to preserve all topological structure of the neurons. In contrast, above $η^*$, the learning process allows for topological simplification, making the neuron manifold progressively coarser and thereby reducing the model's expressivity. Viewed in combination with the recent discovery of the edge of stability phenomenon, the learning dynamics of neuron networks under gradient descent can be divided into two phases: first they undergo smooth optimization under topological constraints, and then enter a second phase where they learn through drastic topological simplifications. A key feature of our theory is that it is independent of specific architectures or loss functions, enabling the universal application of topological methods to the study of deep learning.
