Table of Contents
Fetching ...

Topological Invariance and Breakdown in Learning

Yongyi Yang, Tomaso Poggio, Isaac Chuang, Liu Ziyin

TL;DR

This work analyzes learning dynamics under permutation-equivariant updates and proves a universal topological constraint on neuron configurations: for small steps, updates induce a bi-Lipschitz (hence topology-preserving) mapping between neurons, while large steps permit topological simplifications that reduce expressivity. The theory identifies a topological critical point η^* = 1/K, separating topology-preserving and topology-changing phases, and connects these phases to the edge-of-stability phenomenon observed in practice. Importantly, the results do not depend on specific architectures or losses, yielding a universal framework where permutation symmetry enforces topology on neuron distributions and densities through measure-preserving updates when below the critical rate. The paper further demonstrates the applicability to Gradient Descent and Adam, provides experimental illustrations with Betti numbers on low-dimensional neuron manifolds and real tasks, and discusses implications for learning-rate schedules and deep-learning theory beyond traditional NTK/mean-field pictures.

Abstract

We prove that for a broad class of permutation-equivariant learning rules (including SGD, Adam, and others), the training process induces a bi-Lipschitz mapping between neurons and strongly constrains the topology of the neuron distribution during training. This result reveals a qualitative difference between small and large learning rates $η$. With a learning rate below a topological critical point $η^*$, the training is constrained to preserve all topological structure of the neurons. In contrast, above $η^*$, the learning process allows for topological simplification, making the neuron manifold progressively coarser and thereby reducing the model's expressivity. Viewed in combination with the recent discovery of the edge of stability phenomenon, the learning dynamics of neuron networks under gradient descent can be divided into two phases: first they undergo smooth optimization under topological constraints, and then enter a second phase where they learn through drastic topological simplifications. A key feature of our theory is that it is independent of specific architectures or loss functions, enabling the universal application of topological methods to the study of deep learning.

Topological Invariance and Breakdown in Learning

TL;DR

This work analyzes learning dynamics under permutation-equivariant updates and proves a universal topological constraint on neuron configurations: for small steps, updates induce a bi-Lipschitz (hence topology-preserving) mapping between neurons, while large steps permit topological simplifications that reduce expressivity. The theory identifies a topological critical point η^* = 1/K, separating topology-preserving and topology-changing phases, and connects these phases to the edge-of-stability phenomenon observed in practice. Importantly, the results do not depend on specific architectures or losses, yielding a universal framework where permutation symmetry enforces topology on neuron distributions and densities through measure-preserving updates when below the critical rate. The paper further demonstrates the applicability to Gradient Descent and Adam, provides experimental illustrations with Betti numbers on low-dimensional neuron manifolds and real tasks, and discusses implications for learning-rate schedules and deep-learning theory beyond traditional NTK/mean-field pictures.

Abstract

We prove that for a broad class of permutation-equivariant learning rules (including SGD, Adam, and others), the training process induces a bi-Lipschitz mapping between neurons and strongly constrains the topology of the neuron distribution during training. This result reveals a qualitative difference between small and large learning rates . With a learning rate below a topological critical point , the training is constrained to preserve all topological structure of the neurons. In contrast, above , the learning process allows for topological simplification, making the neuron manifold progressively coarser and thereby reducing the model's expressivity. Viewed in combination with the recent discovery of the edge of stability phenomenon, the learning dynamics of neuron networks under gradient descent can be divided into two phases: first they undergo smooth optimization under topological constraints, and then enter a second phase where they learn through drastic topological simplifications. A key feature of our theory is that it is independent of specific architectures or loss functions, enabling the universal application of topological methods to the study of deep learning.

Paper Structure

This paper contains 42 sections, 10 theorems, 38 equations, 11 figures, 1 table.

Key Result

Lemma 1

The following statement holds when $U^{(t)}$ satisfies P1. For any $i,j \in I$ such that $i \neq j$, if at time $t$ we have ${\boldsymbol{x}}_i^{(t)} = {\boldsymbol{x}}_j^{(t)}$, then, ${\boldsymbol{x}}_i^{(t + 1)} = {\boldsymbol{x}}_j^{(t + 1)}$.

Figures (11)

  • Figure 1: At a small learning, common learning algorithms induce a homeomorphic transformation of the neuron distribution (blue shapes in Figure), a mechanism underlying common theories including the NTK / lazy regime jacot2018neuralchizat2018lazy and the mean-field / feature-learning regime yang2020feature. In contrast, most neural networks in real training scenarios are known to move towards the "edge of stability," where the discrete-time updates are no longer stable at any first-order stationary point. From the perspective of topology, what separates these two regimes is the topology invariance in the first regime, where the learning process is strongly constrained to preserve any topological properties, and the topological breakdown in the second regime, where the learning ceases to preserve topology and acts as a simplifier that merges neurons and makes the model more and more constrained in capacity.
  • Figure 2: An optimization perspective of the topological critical point. The topological critical point $\eta^* = 1/K$ corresponds to the step size that reduces the loss optimally, while the critical step size found by cohen2021gradient corresponds to the largest one ensuring loss decay.
  • Figure 3: Topology of a 2D neural network with GD. The neurons are initialized on a genus-2 surface and optimized with GD. We visualize the topology of 2D and 3D networks before and after training under different step sizes $\eta$. For small step sizes, the training may deform the geometric arrangement of the neurons but the topology remains unchanged. In contrast, for large step sizes, the topological structure can change substantially. These results consistently verify our theoretical predictions that while the geometry of the neurons can be affected by training, the underlying topology is stable under small learning rates but fragile under large ones.
  • Figure 5: Evolution of Betti numbers during training with GD. The main panel shows results for the large learning rate, while the inset shows results for the small one. Each curve is obtained by averaging over 10 runs; the shaded regions indicate the standard deviations.
  • Figure 6: Topology of a 2D neural network with GD and disjoint genus-1 initialization. The neurons are initialized on the disjoint union of two genus-1 surfaces and optimized with GD.
  • ...and 6 more figures

Theorems & Definitions (12)

  • Lemma 1: Well-definedness
  • Lemma 2: No Merging or Splitting
  • Lemma 3
  • Theorem 1: Main
  • Theorem 2
  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Lemma 4: No Splitting
  • proof
  • ...and 2 more