Table of Contents
Fetching ...

Proving Linear Mode Connectivity of Neural Networks via Optimal Transport

Damien Ferbach, Baptiste Goujaud, Gauthier Gidel, Aymeric Dieuleveut

TL;DR

The paper tackles the question of why diverse SGD-trained networks can be connected by a low-loss linear path, up to permutation of hidden units. It develops an optimal-transport framework rooted in mean-field theory to prove linear mode connectivity for multi-layer MLPs, deriving layer-wise width bounds and extending to Gaussian and sub-Gaussian weight distributions. A novel weight-matching method leveraging covariance information is proposed and shown to improve LMC in experiments on MNIST and CIFAR-10, with results tying the dimension of weight-distribution support to LMC effectiveness. The work provides a theoretically grounded mechanism linking weight distribution geometry, network width, and alignment, offering practical guidelines for achieving LMC through permutation-aware weight matching.

Abstract

The energy landscape of high-dimensional non-convex optimization problems is crucial to understanding the effectiveness of modern deep neural network architectures. Recent works have experimentally shown that two different solutions found after two runs of a stochastic training are often connected by very simple continuous paths (e.g., linear) modulo a permutation of the weights. In this paper, we provide a framework theoretically explaining this empirical observation. Based on convergence rates in Wasserstein distance of empirical measures, we show that, with high probability, two wide enough two-layer neural networks trained with stochastic gradient descent are linearly connected. Additionally, we express upper and lower bounds on the width of each layer of two deep neural networks with independent neuron weights to be linearly connected. Finally, we empirically demonstrate the validity of our approach by showing how the dimension of the support of the weight distribution of neurons, which dictates Wasserstein convergence rates is correlated with linear mode connectivity.

Proving Linear Mode Connectivity of Neural Networks via Optimal Transport

TL;DR

The paper tackles the question of why diverse SGD-trained networks can be connected by a low-loss linear path, up to permutation of hidden units. It develops an optimal-transport framework rooted in mean-field theory to prove linear mode connectivity for multi-layer MLPs, deriving layer-wise width bounds and extending to Gaussian and sub-Gaussian weight distributions. A novel weight-matching method leveraging covariance information is proposed and shown to improve LMC in experiments on MNIST and CIFAR-10, with results tying the dimension of weight-distribution support to LMC effectiveness. The work provides a theoretically grounded mechanism linking weight distribution geometry, network width, and alignment, offering practical guidelines for achieving LMC through permutation-aware weight matching.

Abstract

The energy landscape of high-dimensional non-convex optimization problems is crucial to understanding the effectiveness of modern deep neural network architectures. Recent works have experimentally shown that two different solutions found after two runs of a stochastic training are often connected by very simple continuous paths (e.g., linear) modulo a permutation of the weights. In this paper, we provide a framework theoretically explaining this empirical observation. Based on convergence rates in Wasserstein distance of empirical measures, we show that, with high probability, two wide enough two-layer neural networks trained with stochastic gradient descent are linearly connected. Additionally, we express upper and lower bounds on the width of each layer of two deep neural networks with independent neuron weights to be linearly connected. Finally, we empirically demonstrate the validity of our approach by showing how the dimension of the support of the weight distribution of neurons, which dictates Wasserstein convergence rates is correlated with linear mode connectivity.
Paper Structure (57 sections, 36 theorems, 164 equations, 3 figures)

This paper contains 57 sections, 36 theorems, 164 equations, 3 figures.

Key Result

Theorem 3.1

Consider two two-layer neural networks as in network_form trained with equation eq:sgd with the same initialization over the weights independently and for the same underlying time $T$. Suppose ass:noiseless_SGDass:noisy_SGD to hold. Then $\forall \delta, {\text{err}}, \exists N_{min}$ such that if $

Figures (3)

  • Figure 1: Permuting the neurons in the hidden layer of network $B$ to align them on network $A$
  • Figure 2: Statistics of the average network $M$ over the linear path between networks $A$ and $B$ using respectively weight matching (blue), weight matching using covariance of activations and activations (green), and activation matching (orange)
  • Figure 3: Statistics of the average network $M$ over the linear path between networks $A$ and $B$ using respectively weight matching (blue) and activation matching (orange)

Theorems & Definitions (66)

  • Theorem 3.1
  • Corollary 3.1
  • Lemma 4.0
  • Lemma 5.0
  • Theorem 5.1
  • Theorem 5.2
  • Theorem 5.3
  • Definition B.1: Wasserstein distance villani2009optimal
  • Lemma B.2: Convexity of the optimal cost (Theorem 4.8 in villani2009optimal)
  • proof
  • ...and 56 more