Flat Channels to Infinity in Neural Loss Landscapes
Flavio Martinelli, Alexander Van Meegen, Berfin Şimşek, Wulfram Gerstner, Johanni Brea
TL;DR
This paper uncovers channels to infinity in neural loss landscapes—directions along which gradient flow makes extremely slow progress while two neuron readouts diverge and their input weights coalesce, parallel to symmetry-induced saddle lines created by neuron duplication. In the limit where divergent readouts and convergent inputs occur, the network implements a gated linear unit, providing a novel functional interpretation of these quasi-flat regions. The authors develop a formal reparameterization and an epsilon-expansion analysis showing convergence to gated linear units and illustrating stability properties via both theory and toy experiments; they demonstrate that gradient-based optimizers routinely approach these channels from random initializations and that channels can host minima at infinity with distinct computational capabilities. These insights offer a new lens on non-convex optimization in deep networks and suggest practical implications for generalization, model fusion, and training dynamics in large-scale architectures, including potential extensions to multi-neuron channels and deeper networks.
Abstract
The loss landscapes of neural networks contain minima and saddle points that may be connected in flat regions or appear in isolation. We identify and characterize a special structure in the loss landscape: channels along which the loss decreases extremely slowly, while the output weights of at least two neurons, $a_i$ and $a_j$, diverge to $\pm$infinity, and their input weight vectors, $\mathbf{w_i}$ and $\mathbf{w_j}$, become equal to each other. At convergence, the two neurons implement a gated linear unit: $a_iσ(\mathbf{w_i} \cdot \mathbf{x}) + a_jσ(\mathbf{w_j} \cdot \mathbf{x}) \rightarrow σ(\mathbf{w} \cdot \mathbf{x}) + (\mathbf{v} \cdot \mathbf{x}) σ'(\mathbf{w} \cdot \mathbf{x})$. Geometrically, these channels to infinity are asymptotically parallel to symmetry-induced lines of critical points. Gradient flow solvers, and related optimization methods like SGD or ADAM, reach the channels with high probability in diverse regression settings, but without careful inspection they look like flat local minima with finite parameter values. Our characterization provides a comprehensive picture of these quasi-flat regions in terms of gradient dynamics, geometry, and functional interpretation. The emergence of gated linear units at the end of the channels highlights a surprising aspect of the computational capabilities of fully connected layers.
