The Empirical Impact of Neural Parameter Symmetries, or Lack Thereof

Derek Lim; Theo Moe Putterman; Robin Walters; Haggai Maron; Stefanie Jegelka

The Empirical Impact of Neural Parameter Symmetries, or Lack Thereof

Derek Lim, Theo Moe Putterman, Robin Walters, Haggai Maron, Stefanie Jegelka

TL;DR

This work empirically investigate the impact of neural parameter symmetries by introducing new neural network architectures that have reduced parameter space symmetries, and develops two methods, with some provable guarantees, of modifying standard neural networks to reduce parameter space symmetries.

Abstract

Many algorithms and observed phenomena in deep learning appear to be affected by parameter symmetries -- transformations of neural network parameters that do not change the underlying neural network function. These include linear mode connectivity, model merging, Bayesian neural network inference, metanetworks, and several other characteristics of optimization or loss-landscapes. However, theoretical analysis of the relationship between parameter space symmetries and these phenomena is difficult. In this work, we empirically investigate the impact of neural parameter symmetries by introducing new neural network architectures that have reduced parameter space symmetries. We develop two methods, with some provable guarantees, of modifying standard neural networks to reduce parameter space symmetries. With these new methods, we conduct a comprehensive experimental study consisting of multiple tasks aimed at assessing the effect of removing parameter symmetries. Our experiments reveal several interesting observations on the empirical impact of parameter symmetries; for instance, we observe linear mode connectivity between our networks without alignment of weight spaces, and we find that our networks allow for faster and more effective Bayesian neural network training. Our code is available at https://github.com/cptq/asymmetric-networks

The Empirical Impact of Neural Parameter Symmetries, or Lack Thereof

TL;DR

Abstract

Paper Structure (43 sections, 11 theorems, 43 equations, 6 figures, 14 tables)

This paper contains 43 sections, 11 theorems, 43 equations, 6 figures, 14 tables.

Introduction
Background and Definitions
Related Work
Asymmetric Networks
Computation Graph Approach ($\mathbf{W}$-Asymmetric Networks)
Nonlinearity Approach ($\sigma$-Asymmetric Networks)
FiGLU: the Fixed Gated Linear Unit Nonlinearity
Extension to Other Architectures
Universal Approximation
Experiments
Linear Mode Connectivity without Permutation Alignment
Bayesian Neural Networks
Metanetworks
Monotonic Linear Interpolation
Other Optimization and Loss Landscape Properties
...and 28 more sections

Key Result

Theorem 1

If each mask matrix $M$ has unique nonzero rows, then $\mathbf{W}$-Asymmetric MLPs with fixed entries set to zero have no nontrivial neural DAG automorphisms.

Figures (6)

Figure 1: (Left) Standard MLP. The hidden nodes (grey hatches) can be freely permuted, which induces permutation parameter symmetries. Black edges denote trainable parameters. (Middle) Our $\mathbf{W}$-Asymmetric MLP, which fixes certain weights to be constant and untrainable (colored dashed lines) to break parameter symmetries. (Right) Our $\sigma$-Asymmetric MLP, which uses our FiGLU nonlinearity involving a fixed matrix $\mathbf{\textcolor{purple}{F}}$ (colored dashed lines) to break parameter symmetries.
Figure 2: Depiction of our $\mathbf{W}$-Asymmetric approach to removing parameter symmetries. Entries with a black outline are untrained. Note that the $\mathbf{W}$-Asym linear map has 2 nonzeros per row, the $\mathbf{W}$-Asym convolution with fixed entries has 8 fixed entries for its single output channel, and the $\mathbf{W}$-Asym convolution with fixed filters has a single input filter fixed. We often use a constant number of fixed entries per row or output channel in our experiments.
Figure 3: Linear mode connectivity: test loss curves along linear interpolations between trained networks. (Left) MLP on MNIST. (Middle) ResNet with $8\times$ width on CIFAR-10. (Right) GNN on ogbn-arXiv. $\mathbf{W}$-Asymmetric networks interpolate the best, followed by networks aligned with Git-Rebasin, then $\sigma$-Asymmetric networks, and finally standard networks.
Figure 4: Bayesian neural network training loss over time for depth 8 MLPs on MNIST (left), ResNet110 on CIFAR-10 (middle), and ResNet20 with BatchNorm on CIFAR-100 (right). $\mathbf{W}$-Asymmetric networks train more quickly, and achieve lower training loss.
Figure 5: Train loss against interpolation coefficient $\alpha$ for the interpolation $(1-\alpha) \theta_0 + \alpha \theta_T$ between initial parameters $\theta_0$ and trained parameters $\theta_T$. Trajectories for the 20 $(\theta_0, \theta_T)$ pairs of lowest train loss for each architecture are plotted. The trajectories for Asymmetric ResNets appear significantly more monotonic and convex.
...and 1 more figures

Theorems & Definitions (18)

Theorem 1
Proposition 1
Proposition 2
Theorem 2: Informal
Theorem 3
proof
Lemma 1
proof
Proposition 3
proof
...and 8 more

The Empirical Impact of Neural Parameter Symmetries, or Lack Thereof

TL;DR

Abstract

The Empirical Impact of Neural Parameter Symmetries, or Lack Thereof

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (18)