Table of Contents
Fetching ...

Entropic Confinement and Mode Connectivity in Overparameterized Neural Networks

Luca Di Carlo, Chase Goddard, David J. Schwab

TL;DR

The study tackles the puzzle of why minima connected by low-loss paths behave as if in a connected valley while SGD remains localized. It introduces curvature-induced entropic forces arising from SGD noise that bias dynamics toward flatter regions, formalized via an effective potential $V_{eff}(y) = T \ln g(y)$. Through AutoNEB-generated minimum-energy paths between CIFAR-10 minima and a suite of curvature diagnostics, the authors show a systematic rise in curvature away from endpoints, creating entropic barriers that persist longer than energetic barriers and drive late-stage localization. These findings refine the valley metaphor into a curvature-modulated landscape and have implications for linear mode connectivity, generalization, and ensembling techniques like SWA.

Abstract

Modern neural networks exhibit a striking property: basins of attraction in the loss landscape are often connected by low-loss paths, yet optimization dynamics generally remain confined to a single convex basin and rarely explore intermediate points. We resolve this paradox by identifying entropic barriers arising from the interplay between curvature variations along these paths and noise in optimization dynamics. Empirically, we find that curvature systematically rises away from minima, producing effective forces that bias noisy dynamics back toward the endpoints - even when the loss remains nearly flat. These barriers persist longer than energetic barriers, shaping the late-time localization of solutions in parameter space. Our results highlight the role of curvature-induced entropic forces in governing both connectivity and confinement in deep learning landscapes.

Entropic Confinement and Mode Connectivity in Overparameterized Neural Networks

TL;DR

The study tackles the puzzle of why minima connected by low-loss paths behave as if in a connected valley while SGD remains localized. It introduces curvature-induced entropic forces arising from SGD noise that bias dynamics toward flatter regions, formalized via an effective potential . Through AutoNEB-generated minimum-energy paths between CIFAR-10 minima and a suite of curvature diagnostics, the authors show a systematic rise in curvature away from endpoints, creating entropic barriers that persist longer than energetic barriers and drive late-stage localization. These findings refine the valley metaphor into a curvature-modulated landscape and have implications for linear mode connectivity, generalization, and ensembling techniques like SWA.

Abstract

Modern neural networks exhibit a striking property: basins of attraction in the loss landscape are often connected by low-loss paths, yet optimization dynamics generally remain confined to a single convex basin and rarely explore intermediate points. We resolve this paradox by identifying entropic barriers arising from the interplay between curvature variations along these paths and noise in optimization dynamics. Empirically, we find that curvature systematically rises away from minima, producing effective forces that bias noisy dynamics back toward the endpoints - even when the loss remains nearly flat. These barriers persist longer than energetic barriers, shaping the late-time localization of solutions in parameter space. Our results highlight the role of curvature-induced entropic forces in governing both connectivity and confinement in deep learning landscapes.

Paper Structure

This paper contains 24 sections, 9 equations, 9 figures, 1 algorithm.

Figures (9)

  • Figure 1: Curvature produces an entropic force. (A) Illustration of a potential $V(r, \theta)$ with a circular minimum at $r=1$, where the curvature varies with angle. At zero temperature ($T=0$), the angular distribution is uniform, $P(\theta) = 1/(2\pi)$. At finite temperature, thermal fluctuations bias the system toward flatter regions (yellow) rather than sharper ones (red). (B) Example of a Brownian particle diffusing along the ridge of a loss landscape, lighter colors correspond to larger times. Entropic forces generated by fluctuations push the particle toward flatter directions, effectively favoring broader regions of the landscape.
  • Figure 2: Entropy induces barriers between minima. (A,B) Curvature along minimum energy paths (MEPs) connecting different minima, measured via the trace of the Hessian (A) and the maximum eigenvalue of the Hessian (B). Numbers indicate distinct minima found via independent training runs, markers indicate pivot points; different colors correspond to different pairs of minima, and marker shapes denote MEPs found via different instantiations of the AutoNEB random seed. (C) Cross entropy loss along MEPs connecting different pairs of regular minima. (D) Spectrum of the Hessian along MEP 1–2, estimated via singular value decomposition (SVD) of the score matrix computed on $E = 1024$ training examples. As we move into the interior of the MEP, the entire spectrum shifts upward, reflecting an increase in the curvature in all directions along the path.
  • Figure 3: Relaxation dynamics induced by entropic forces. (A) Relaxation dynamics along the MEP for Vanilla projected SGD (batch size $B=16$, learning rate $\eta = 0.02$) for models initialized at different points along the MEP (We use MEP 1-2 from Figure \ref{['fig:hess-mep']}). The legend shows the two closest pivots to each starting point. Models initialized deeper along the MEP take longer to relax to the endpoint. (B, C) Models are initialized between the second and third pivots of the MEP, and trained using projected SGD constrained to the path (see Section \ref{['sec:k-SGD']}). The $y$-axis shows the relative Euclidean distance along the MEP, where 0 and 1 correspond to the endpoints of the path. The entropic force drives the models back toward the endpoints. (B) Models trained with learning rate $\eta = 0.02$ for increasing batch sizes. Relaxation to the endpoint is faster for smaller minibatches, indicating that entropic forces are stronger for smaller batch sizes. (C) Models trained with minibatch size 16 for increasing learning rates. Relaxation to the endpoint is faster for larger learning rates, indicating that entropic forces are stronger at higher effective temperatures. Different curves of the same color correspond to different realizations of the SGD noise.
  • Figure 4: Relaxation dynamics induced by entropic forces for different optimizers. Relaxation dynamics along the MEP for projected dynamics using Adam (green) and SGD with Nesterov momentum (red), compared to vanilla SGD (black). We plot the results against the number of updates (A) and the effective time (B). The effect of the entropic forces seems to be more prominent for both Adam and SGD with Nesterov momentum.
  • Figure 5: Entropic barriers are relevant later in training. (A) Linear mode connectivity schematic frankle-LMC. We train a network to epoch $k$, then produce two new networks via different data ordering, and measure the loss along a linear path. (B) The average loss along such a path goes down as $k$ increases, decreasing rapidly with $k$. (C)Top: The loss profile along linear paths for various $k$. Bottom: The curvature profile, measured by the maximum Hessian eigenvalue, for various $k$. (D) We plot the instability (The relative change along the path) of the loss and the curvature. For small $k$, the loss exhibits larger instability, while for larger $k$, the curvature exhibits larger instability.
  • ...and 4 more figures