Table of Contents
Fetching ...

Symmetry Induces Structure and Constraint of Learning

Liu Ziyin

TL;DR

This work addresses how loss-function symmetries shape learning in deep networks by introducing a unified mirror-reflection framework. It proves a central result that any $O$-mirror symmetric loss imposes a constraint $O^T\theta=0$, and shows that SGD with weight decay or gradient noise drives training toward these symmetry-constrained solutions, yielding structured phenomena such as sparsity, low-rankness, and homogeneous ensembling. The authors extend the theory with an L1-equivalence view and a differentiable constraint algorithm (DCS) to enforce symmetry-induced constraints in practice, and validate the framework across rescaling, rotation, and permutation symmetries including experiments on linear regression, matrix factorization, CIFAR-10 with ResNet18, and transformers. The findings offer a principled explanation for loss of plasticity and neural collapses and provide practical design guidance for enforcing or removing symmetries to tailor model capacity and representation structure.

Abstract

Due to common architecture designs, symmetries exist extensively in contemporary neural networks. In this work, we unveil the importance of the loss function symmetries in affecting, if not deciding, the learning behavior of machine learning models. We prove that every mirror-reflection symmetry, with reflection surface $O$, in the loss function leads to the emergence of a constraint on the model parameters $θ$: $O^Tθ=0$. This constrained solution becomes satisfied when either the weight decay or gradient noise is large. Common instances of mirror symmetries in deep learning include rescaling, rotation, and permutation symmetry. As direct corollaries, we show that rescaling symmetry leads to sparsity, rotation symmetry leads to low rankness, and permutation symmetry leads to homogeneous ensembling. Then, we show that the theoretical framework can explain intriguing phenomena, such as the loss of plasticity and various collapse phenomena in neural networks, and suggest how symmetries can be used to design an elegant algorithm to enforce hard constraints in a differentiable way.

Symmetry Induces Structure and Constraint of Learning

TL;DR

This work addresses how loss-function symmetries shape learning in deep networks by introducing a unified mirror-reflection framework. It proves a central result that any -mirror symmetric loss imposes a constraint , and shows that SGD with weight decay or gradient noise drives training toward these symmetry-constrained solutions, yielding structured phenomena such as sparsity, low-rankness, and homogeneous ensembling. The authors extend the theory with an L1-equivalence view and a differentiable constraint algorithm (DCS) to enforce symmetry-induced constraints in practice, and validate the framework across rescaling, rotation, and permutation symmetries including experiments on linear regression, matrix factorization, CIFAR-10 with ResNet18, and transformers. The findings offer a principled explanation for loss of plasticity and neural collapses and provide practical design guidance for enforcing or removing symmetries to tailor model capacity and representation structure.

Abstract

Due to common architecture designs, symmetries exist extensively in contemporary neural networks. In this work, we unveil the importance of the loss function symmetries in affecting, if not deciding, the learning behavior of machine learning models. We prove that every mirror-reflection symmetry, with reflection surface , in the loss function leads to the emergence of a constraint on the model parameters : . This constrained solution becomes satisfied when either the weight decay or gradient noise is large. Common instances of mirror symmetries in deep learning include rescaling, rotation, and permutation symmetry. As direct corollaries, we show that rescaling symmetry leads to sparsity, rotation symmetry leads to low rankness, and permutation symmetry leads to homogeneous ensembling. Then, we show that the theoretical framework can explain intriguing phenomena, such as the loss of plasticity and various collapse phenomena in neural networks, and suggest how symmetries can be used to design an elegant algorithm to enforce hard constraints in a differentiable way.
Paper Structure (30 sections, 6 theorems, 56 equations, 9 figures)

This paper contains 30 sections, 6 theorems, 56 equations, 9 figures.

Key Result

Theorem 1

Let $\ell_0(w)$ satisfy the $O$-mirror symmetry. Then,

Figures (9)

  • Figure 1: Illustration of a simple mirror symmetry when $w\in \mathbb{R}^2$. Here, the mirror surface is $O^T= ((1,-1), (0,0))/\sqrt{2}$. Points $A$ and $B$ have the same loss value when the loss contains the $O$ symmetry. The projection of $A$ and $B$ onto the mirror surface, $C$, has a strictly smaller norm and is thus preferred by weight decay. Furthermore, any gradient on the mirror must also point within the mirror, so gradient (or gradient noise) cannot take the parameter outside the mirror once entered.
  • Figure 2: When symmetries exist, the symmetric solutions have highly structured Hessians. Left: the symmetry mirror $O$ partitions $H$ into two blocks: one block parallel to surfaces in $OO^T$, and the other orthogonal to it. When an extra symmetry exists, these two blocks can be decomposed into additional subblocks. Mid-Right: the loss function around a symmetric solution has a universal geometry. Here, $s$ is the component of the parameters along a direction of the $O$-symmetry. The competition between the signal in the dataset and the regularization strength determines the local landscape.
  • Figure 3: When loss function symmetries are present, the model converges to structurally constrained solutions at a high weight decay or gradient noise. Left: A vanilla linear regression trained with SGD does not converge to sparse solutions for any learning rate. When we introduce redundant rescaling symmetry to every parameter, sparser solutions are favored at higher learning rates ($\lambda$). Mid: Vanilla $200$ dimensional matrix factorization trained with SGD prefers lower-rank solutions when the gradient noise is strong due to the rotation symmetry. The inset shows that the model always stays full-rank if we remove the rotation symmetry by introducing residual connections. Right: Correlation of the pre-activation value of neurons in the penultimate layer of ResNet18. After training, the neurons are grouped into homogeneous blocks when weight decay is present. The inset shows that such block structures are rare when there is no weight decay. Also, the patterns are similar for post-activation values (Section \ref{['app sec: exp concerns']}), which further supports the claim that the block structures are due to the symmetry, not because of linearity. See Section \ref{['app sec: exp concerns']} for the experimental details.
  • Figure 4: Loss of plasticity in continual learning in a vanilla linear regressor (dashed) and linear regressors with rescaling symmetry (solid). Vanilla regression has no symmetry and does not suffer plasticity loss, whereas having symmetries leads to the loss of plasticity. One can fix the problem with one of the two suggested methods, either by removing the symmetry in the model or removing the absorbing states by injecting noise.
  • Figure 5: Stationary conditions in different loss landscapes. Left: $L= (wu - 1)^2$. Here, $u=w$ and $u=-w$ are the stationary conditions caused by the rescaling symmetry. Right: $\theta=(u,w)$ and $L=-||\theta||^2 + ||\theta||^4$. Here, the stationary condition caused by the rotation symmetry is every straight line crossing the origin. Every stationary condition delineates a submanifold of the entire landscape. Once the model is in this submanifold, SGD cannot leave it.
  • ...and 4 more figures

Theorems & Definitions (13)

  • Definition 1
  • Definition 2
  • Theorem 1
  • Definition 3
  • Corollary 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • proof
  • ...and 3 more