Table of Contents
Fetching ...

Neural Mechanics: Symmetry and Broken Conservation Laws in Deep Learning Dynamics

Daniel Kunin, Javier Sagastuy-Brena, Surya Ganguli, Daniel L. K. Yamins, Hidenori Tanaka

TL;DR

The paper addresses the challenge of understanding neural network learning dynamics under finite stochastic gradient updates by introducing a symmetry-based framework that ties architectural invariances to geometric constraints on gradients and Hessians, yielding Noether-like conservation laws under gradient flow. It then develops a realistic continuous SGD model that incorporates weight decay, momentum, stochasticity, and finite learning rates, deriving exact dynamics for symmetry-related parameter combinations and validating them on VGG-16 trained on Tiny ImageNet. Key contributions include unifying gradient/Hessian geometry via symmetries, identifying conservation laws under gradient flow, and deriving exact finite-rate learning dynamics through a modified loss and flow, with strong empirical support. The work provides a principled foundation for analyzing and predicting training dynamics in state-of-the-art networks, potentially guiding optimizer design and architectural choices at realistic scales.

Abstract

Understanding the dynamics of neural network parameters during training is one of the key challenges in building a theoretical foundation for deep learning. A central obstacle is that the motion of a network in high-dimensional parameter space undergoes discrete finite steps along complex stochastic gradients derived from real-world datasets. We circumvent this obstacle through a unifying theoretical framework based on intrinsic symmetries embedded in a network's architecture that are present for any dataset. We show that any such symmetry imposes stringent geometric constraints on gradients and Hessians, leading to an associated conservation law in the continuous-time limit of stochastic gradient descent (SGD), akin to Noether's theorem in physics. We further show that finite learning rates used in practice can actually break these symmetry induced conservation laws. We apply tools from finite difference methods to derive modified gradient flow, a differential equation that better approximates the numerical trajectory taken by SGD at finite learning rates. We combine modified gradient flow with our framework of symmetries to derive exact integral expressions for the dynamics of certain parameter combinations. We empirically validate our analytic expressions for learning dynamics on VGG-16 trained on Tiny ImageNet. Overall, by exploiting symmetry, our work demonstrates that we can analytically describe the learning dynamics of various parameter combinations at finite learning rates and batch sizes for state of the art architectures trained on any dataset.

Neural Mechanics: Symmetry and Broken Conservation Laws in Deep Learning Dynamics

TL;DR

The paper addresses the challenge of understanding neural network learning dynamics under finite stochastic gradient updates by introducing a symmetry-based framework that ties architectural invariances to geometric constraints on gradients and Hessians, yielding Noether-like conservation laws under gradient flow. It then develops a realistic continuous SGD model that incorporates weight decay, momentum, stochasticity, and finite learning rates, deriving exact dynamics for symmetry-related parameter combinations and validating them on VGG-16 trained on Tiny ImageNet. Key contributions include unifying gradient/Hessian geometry via symmetries, identifying conservation laws under gradient flow, and deriving exact finite-rate learning dynamics through a modified loss and flow, with strong empirical support. The work provides a principled foundation for analyzing and predicting training dynamics in state-of-the-art networks, potentially guiding optimizer design and architectural choices at realistic scales.

Abstract

Understanding the dynamics of neural network parameters during training is one of the key challenges in building a theoretical foundation for deep learning. A central obstacle is that the motion of a network in high-dimensional parameter space undergoes discrete finite steps along complex stochastic gradients derived from real-world datasets. We circumvent this obstacle through a unifying theoretical framework based on intrinsic symmetries embedded in a network's architecture that are present for any dataset. We show that any such symmetry imposes stringent geometric constraints on gradients and Hessians, leading to an associated conservation law in the continuous-time limit of stochastic gradient descent (SGD), akin to Noether's theorem in physics. We further show that finite learning rates used in practice can actually break these symmetry induced conservation laws. We apply tools from finite difference methods to derive modified gradient flow, a differential equation that better approximates the numerical trajectory taken by SGD at finite learning rates. We combine modified gradient flow with our framework of symmetries to derive exact integral expressions for the dynamics of certain parameter combinations. We empirically validate our analytic expressions for learning dynamics on VGG-16 trained on Tiny ImageNet. Overall, by exploiting symmetry, our work demonstrates that we can analytically describe the learning dynamics of various parameter combinations at finite learning rates and batch sizes for state of the art architectures trained on any dataset.

Paper Structure

This paper contains 27 sections, 2 theorems, 79 equations, 17 figures, 5 tables.

Key Result

Theorem 1

Symmetry and conservation laws in neural networks. Every differentiable symmetry $\psi(\alpha, \theta)$ of the loss that satisfies $\langle \theta, [\partial_{\alpha} \partial_{\theta} \psi\vert_{\alpha=I}] g(\theta) \rangle = 0$ has the corresponding conservation law, through learning under gradient flow.

Figures (17)

  • Figure 1: Neuron level dynamics are simpler than parameter dynamics. We plot the per-parameter dynamics (left) and per-channel squared Euclidean norm dynamics (right) for the convolutional layers of a VGG-16 model (with batch normalization) trained on Tiny ImageNet with SGD with learning rate $\eta = 0.1$, weight decay $\lambda = 10^{-4}$, and batch size $S = 256$. While the parameter dynamics are noisy and chaotic, the neuron dynamics are smooth and patterned.
  • Figure 2: Visualizing symmetry. We visualize the vector fields associated with simple network components that have translation, scale, and rescale symmetry. In (a) we consider the vector field associated with a neuron $\sigma\left(w_1w_2^\intercal x\right)$ where $\sigma$ is the softmax function. In (b) we consider the vector field associated with a neuron $\text{BN}\left(w_1w_2x_1x_2^\intercal\right)$ where $\text{BN}$ is the batch normalization function. In (c) we consider the vector field associated with a linear path $w_2w_1 x$.
  • Figure 3: Visualizing conservation. Associated with each symmetry is a conserved quantity constraining the gradient flow dynamics to a surface. For translation symmetry (a) the flow is constrained to a hyperplane where the intercept is conserved. For scale symmetry (b) the flow is constrained to a sphere where the radius is conserved. For rescale symmetry (c) the flow is constrained to a hyperbola where the axes are conserved. The color represents the value of the conserved quantity, where blue is positive and red is negative, and the black lines are level sets.
  • Figure 4: Modeling discretization. We visualize the trajectories of gradient descent and momentum (black dots), gradient flow with and without momentum (blue lines), and the modified dynamics (red lines) on the quadratic loss $\mathcal{L}(w) = w^\intercal\left[2.5-1.5-1.52\right]w$. On the left we visualize gradient dynamics using modified loss. On the right we visualize momentum dynamics using modified flow. In both settings the modified continuous dynamics visually track the discrete dynamics better than the original continuous dynamics. See appendix \ref{['appendix:modified-eq-analysis']} for further details.
  • Figure 5: Exact dynamics of VGG-16 on Tiny ImageNet. We plot the column sum of the final linear layer (top row) and the difference between squared channel norms of the fifth and fourth convolutional layer (bottom row) of a VGG-16 model without batch normalization. We plot the squared channel norm of the second convolution layer (middle row) of a VGG-16 model with batch normalization. Both models are trained on Tiny ImageNet with SGD with learning rate $\eta = 0.1$, weight decay $\lambda$, batch size $S = 256$, for $100$ epochs . Colored lines are empirical and black dashed lines are the theoretical predictions from equations (\ref{['eq:translation-sgd-equation']}), (\ref{['eq:scale-sgd-equation']}), and (\ref{['eq:rescale-sgd-equation']}). See appendix \ref{['appendix:experiments']} for more details on the experiments.
  • ...and 12 more figures

Theorems & Definitions (3)

  • Theorem 1
  • Theorem
  • proof