Parameter Symmetry and Noise Equilibrium of Stochastic Gradient Descent

Liu Ziyin; Mingze Wang; Hongchao Li; Lei Wu

Parameter Symmetry and Noise Equilibrium of Stochastic Gradient Descent

Liu Ziyin, Mingze Wang, Hongchao Li, Lei Wu

TL;DR

It is shown that the balance and alignment of gradient noise can serve as a novel alternative mechanism for explaining important phenomena such as progressive sharpening/flattening and representation formation within neural networks and have practical implications for understanding techniques like representation normalization and warmup.

Abstract

Symmetries are prevalent in deep learning and can significantly influence the learning dynamics of neural networks. In this paper, we examine how exponential symmetries -- a broad subclass of continuous symmetries present in the model architecture or loss function -- interplay with stochastic gradient descent (SGD). We first prove that gradient noise creates a systematic motion (a ``Noether flow") of the parameters $θ$ along the degenerate direction to a unique initialization-independent fixed point $θ^*$. These points are referred to as the {\it noise equilibria} because, at these points, noise contributions from different directions are balanced and aligned. Then, we show that the balance and alignment of gradient noise can serve as a novel alternative mechanism for explaining important phenomena such as progressive sharpening/flattening and representation formation within neural networks and have practical implications for understanding techniques like representation normalization and warmup.

Parameter Symmetry and Noise Equilibrium of Stochastic Gradient Descent

TL;DR

Abstract

along the degenerate direction to a unique initialization-independent fixed point

. These points are referred to as the {\it noise equilibria} because, at these points, noise contributions from different directions are balanced and aligned. Then, we show that the balance and alignment of gradient noise can serve as a novel alternative mechanism for explaining important phenomena such as progressive sharpening/flattening and representation formation within neural networks and have practical implications for understanding techniques like representation normalization and warmup.

Paper Structure (29 sections, 14 theorems, 124 equations, 9 figures)

This paper contains 29 sections, 14 theorems, 124 equations, 9 figures.

Introduction
Related Works
Preliminaries
Continuous Symmetry and Noise Equilibria
Noether Flow in Degenerate Directions
Exponential symmetries
Noise Equilibrium and Fixed Point Theorem
Applications
Generalized Matrix Factorization
Balance and Stability of Matrix Factorization
Noise Driven Progressive Sharpening and Flattening.
Flat or Sharp?
Noise-Aligned Solution of Deep Linear Networks
Approximate Symmetry and Bias of SGD
Conclusion
...and 14 more sections

Key Result

Theorem 4.3

Let the per-sample loss satisfy the $A$-exponential symmetry and $\theta_\lambda := \exp[\lambda A] \theta$. Then, for any $\theta$ and any $\gamma\geq 0$,A similar result can be proved for the discrete-time SGD. See Section app sec: discrete time sgd.

Figures (9)

Figure 1: An example of a 2d loss function with scale invariance: $\ell(\theta) = \ell(\lambda \theta)$ for a scalar $\lambda$ and $\theta \in \mathbb{R}^2$. Because of the symmetry, the gradient $\nabla \ell$ must be tangential to the circles whose center is the origin. This implies that the norm $\|\theta\|$ does not change during gradient flow training. However, when the training is stochastic or discrete-time, SGD must move outward. If the model starts at $\theta_t$, it must move to a larger circle. As an illustrative example, this loss function has a unique and attractive fixed point: $\|\theta\|=\infty$. SGD will diverge after training under scale invariance. Also, see Remark \ref{['remark1']} for a discussion of the difference between discrete-time and continuous-time dynamics.
Figure 2: Comparison between GD and SGD for matrix factorizations. Left: Example of a learning trajectory. The convergence speed is almost exponential-like in experiments. Mid: evolution of $10$ individual elements of $\Delta_{ij} := (U^{\top} \Gamma_U U - W\Gamma_W W^{\top})_{ij}$. As the theory shows, they all move close to zero and fluctuate with a small variance. Right: Converged solutions of SGD agree with the prediction of Theorem \ref{['theo: fixed point of standard mf']}, but are an order of magnitude away from the solution found by GD, even if they start from the same init.
Figure 3: A two-layer linear network after training. Here, the problem setting is the same as Figure \ref{['fig:mf robustness']}. The theoretical prediction is computed from Theorem \ref{['theo: fixed point of standard mf']}. Left: balance of the norm is only achieved when $\phi_x = 1$, namely, when the data has an isotropic covariance. We also test SGD with a small weight decay ($10^{-4}$), which is sufficiently small that the solution we obtained for SGD without SGD still holds approximately. In contrast, training with GD + WD always converges to a norm-balanced solution. Right: the sharpness of the converged model trained with SGD. We see that for some data distributions, SGD converges to a sharper solution, whereas it converges to flatter solutions for other data distributions. This flattening and sharpening effect are both due to the noise-balance effect of SGD. Here, we find that the systematic error between experiment and theory is due to the use of a finite learning rate and decreases as we decrease $\eta$.
Figure 4: Dynamics of the stability condition $S$ during the training of a rank-1 matrix factorization problem. The solid lines show the training of SGD with Kaiming init. When the learning rate ($\eta=0.008$) is too large, SGD diverges (orange line). However, when one starts training at a small learning rate ($0.001$) and increases $\eta$ to $0.008$ after 5000 iterations, the training remains stable. This is because SGD training improves the stability condition during training, which is in agreement with the theory. In contrast, the stability condition of GD and that of SGD with a Xavier init increases only slightly. Also, note that both Xavier and Kaiming init. under SGD converges to the same stability condition because the equilibrium is unique.
Figure 5: Norms of weights of multilayer deep linear network during training on MNIST without weight decay. We see that the intermediate layers converge to the same norm during training, whereas the input and output layers are different because they are determined by the input and output noise. This effect is robust against different initializations. This agrees with our analysis for deep linear nets (Theorem\ref{['theo: deep-linear']}). Left: initializing all layers with the same norm. Right: initializing all layers at randomly different norms.
...and 4 more figures

Theorems & Definitions (30)

Definition 4.1
Definition 4.2
Theorem 4.3
Remark 4.4
Theorem 4.5
Definition 4.6
Proposition 5.1
Theorem 5.2
Proposition 5.3
Theorem 5.4
...and 20 more

Parameter Symmetry and Noise Equilibrium of Stochastic Gradient Descent

TL;DR

Abstract

Parameter Symmetry and Noise Equilibrium of Stochastic Gradient Descent

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (30)