Table of Contents
Fetching ...

Flat Channels to Infinity in Neural Loss Landscapes

Flavio Martinelli, Alexander Van Meegen, Berfin Şimşek, Wulfram Gerstner, Johanni Brea

TL;DR

This paper uncovers channels to infinity in neural loss landscapes—directions along which gradient flow makes extremely slow progress while two neuron readouts diverge and their input weights coalesce, parallel to symmetry-induced saddle lines created by neuron duplication. In the limit where divergent readouts and convergent inputs occur, the network implements a gated linear unit, providing a novel functional interpretation of these quasi-flat regions. The authors develop a formal reparameterization and an epsilon-expansion analysis showing convergence to gated linear units and illustrating stability properties via both theory and toy experiments; they demonstrate that gradient-based optimizers routinely approach these channels from random initializations and that channels can host minima at infinity with distinct computational capabilities. These insights offer a new lens on non-convex optimization in deep networks and suggest practical implications for generalization, model fusion, and training dynamics in large-scale architectures, including potential extensions to multi-neuron channels and deeper networks.

Abstract

The loss landscapes of neural networks contain minima and saddle points that may be connected in flat regions or appear in isolation. We identify and characterize a special structure in the loss landscape: channels along which the loss decreases extremely slowly, while the output weights of at least two neurons, $a_i$ and $a_j$, diverge to $\pm$infinity, and their input weight vectors, $\mathbf{w_i}$ and $\mathbf{w_j}$, become equal to each other. At convergence, the two neurons implement a gated linear unit: $a_iσ(\mathbf{w_i} \cdot \mathbf{x}) + a_jσ(\mathbf{w_j} \cdot \mathbf{x}) \rightarrow σ(\mathbf{w} \cdot \mathbf{x}) + (\mathbf{v} \cdot \mathbf{x}) σ'(\mathbf{w} \cdot \mathbf{x})$. Geometrically, these channels to infinity are asymptotically parallel to symmetry-induced lines of critical points. Gradient flow solvers, and related optimization methods like SGD or ADAM, reach the channels with high probability in diverse regression settings, but without careful inspection they look like flat local minima with finite parameter values. Our characterization provides a comprehensive picture of these quasi-flat regions in terms of gradient dynamics, geometry, and functional interpretation. The emergence of gated linear units at the end of the channels highlights a surprising aspect of the computational capabilities of fully connected layers.

Flat Channels to Infinity in Neural Loss Landscapes

TL;DR

This paper uncovers channels to infinity in neural loss landscapes—directions along which gradient flow makes extremely slow progress while two neuron readouts diverge and their input weights coalesce, parallel to symmetry-induced saddle lines created by neuron duplication. In the limit where divergent readouts and convergent inputs occur, the network implements a gated linear unit, providing a novel functional interpretation of these quasi-flat regions. The authors develop a formal reparameterization and an epsilon-expansion analysis showing convergence to gated linear units and illustrating stability properties via both theory and toy experiments; they demonstrate that gradient-based optimizers routinely approach these channels from random initializations and that channels can host minima at infinity with distinct computational capabilities. These insights offer a new lens on non-convex optimization in deep networks and suggest practical implications for generalization, model fusion, and training dynamics in large-scale architectures, including potential extensions to multi-neuron channels and deeper networks.

Abstract

The loss landscapes of neural networks contain minima and saddle points that may be connected in flat regions or appear in isolation. We identify and characterize a special structure in the loss landscape: channels along which the loss decreases extremely slowly, while the output weights of at least two neurons, and , diverge to infinity, and their input weight vectors, and , become equal to each other. At convergence, the two neurons implement a gated linear unit: . Geometrically, these channels to infinity are asymptotically parallel to symmetry-induced lines of critical points. Gradient flow solvers, and related optimization methods like SGD or ADAM, reach the channels with high probability in diverse regression settings, but without careful inspection they look like flat local minima with finite parameter values. Our characterization provides a comprehensive picture of these quasi-flat regions in terms of gradient dynamics, geometry, and functional interpretation. The emergence of gated linear units at the end of the channels highlights a surprising aspect of the computational capabilities of fully connected layers.

Paper Structure

This paper contains 27 sections, 41 equations, 19 figures, 1 table.

Figures (19)

  • Figure 1: Saddle lines à la Fukumizu & Amari fukumizu2000local and channels to infinity.Left: Duplicating a neuron in a network trained to convergence generates lines of saddle points in the loss landscape fukumizu2000local. Duplicated neurons share the input weights of the original neuron while their output weights $\gamma a, (1-\gamma)a$ sum to the original neuron's output weight $a$. Middle: Loss landscape of duplicated network projected along the saddle line (in red) and the eigenvector of the smallest (most negative) eigenvalue of the loss Hessian. Parallel to the saddle line there are channels to infinity (green curve) along which the loss decreases very slowly. Following the channel, the output weights diverge to infinite norm and the input weights converge to a new value. Right: The solution at infinity implements a new function consisting of a single neuron and a gated linear unit. The gating function is the derivative of the original activation function $\sigma$.
  • Figure 2: Stable plateau-saddles and their loss landscape in MLPs without bias. (a) Networks of 1 to 5 hidden neurons and scalar output are trained on the shown 2D regression target (logarithm of the rosenbrock function, see \ref{['app:duplication']}). Training follows full-batch gradient flow dynamics until convergence to a critical point. A quantification of unique solutions in weight-space (up to permutation symmetries) is shown at the bottom. (b) Loss levels of converged networks: each diamond shows the loss of a converged network, color-code indicates network size. The only source of randomness is the initialization. Many identical-loss solutions are found by networks of different sizes. Inset: Frequency of converged solutions exhibiting duplicated neurons. (c) Loss landscape along the duplication parameter $\gamma$ and the direction of smallest eigenvalue of the Hessian $\alpha \boldsymbol{e}_{\mathrm{min}}(\gamma)$ corresponding to one of the converged solutions shown in b. Small perturbations are stable only within the plateau-saddle region, $\gamma \in (0,1)$. (d) Gradient-flow trajectories following a small perturbation from the saddle line in the direction of $\alpha \boldsymbol{e}_{\mathrm{min}}$ for the example shown in c. Perturbations outside the plateau-saddle region, $\gamma \notin (0,1)$, escape the saddle line and land in other minima.
  • Figure 3: Channels to infinity. (a) Loss landscape of a 4-4-1 MLP trained on a regression task (\ref{['app:channels']}). The saddle line (red straight line) is found via neuron splitting from a local minimum of a 4-3-1 MLP. The surface is a slice of the loss along the splitting parameter $\gamma$ and the direction of smallest (negative) eigenvalue of the Hessian $\alpha \boldsymbol{e}_\mathrm{min}$. Most other eigenvalues are positive. At first glance, it looks as if there were two channels to infinity parallel to the saddle line, but the analysis in the next panels reveals that there is only one (the green curved line). (b) Loss profile along $\alpha\boldsymbol{e}_\mathrm{min}$, color-coded at different values of $\gamma$. Note that the loss is not continually decreasing for positive $\alpha$, indicating that this is not a channel to infinity. (c) A top-view of the landscape for large $\gamma$ reveals that the local picture of the loss landscape in (a) holds also for very large $\gamma$. (d) The two-dimensional projection of the loss landscape in panels (a)-(c) does not show how the loss depends on all the other free parameters of the 4-4-1 MLP. Therefore, we look at gradient-flow trajectories following a small perturbation along $\boldsymbol{e}_\mathrm{min}$ from the saddle line at $\gamma=1.5$. The perturbation direction is shown as green and orange arrows on the surface plot in panel (a). After the green perturbation ($\alpha_-$), the gradient trajectory moves inside a channel to infinity towards increasing values of $\gamma$ following a descent with extremely small slope (green channel to infinity). The orange perturbation ($\alpha_+$) converges to a finite-norm minimum, which confirms that the landscape at positive $\alpha$ is not a channel to infinity. (e) Cosine distance between parameter updates $\Delta \boldsymbol{\theta}$ and direction of the saddle line ($\Gamma$) for the $\alpha_-$ perturbation: after an initial high-dimensional trajectory, parameter updates $\Delta\boldsymbol{\theta}$ are parallel to the saddle line. The ODE dynamics reveal an extremely slow divergence of $\gamma\rightarrow\infty$.
  • Figure 4: Frequency and properties of channels to infinity. (a) As a first criterion to identify channels to infinity, we consider the cosine distance of the pair of closest input weight vectors within a network and the sum of absolute output weights corresponding to that pair. Putative channel solutions are identified by having a large weight norm and a small distance in input weights (top left section of the graph). (b) As a second criterion, we consider channel solutions networks that had parameter updates mostly within the subspace spanned by the putative channels, parallel to the saddle line. After filtering for networks that satisfy (a), we compute the percentage of updates that lie inside the channels subspace. At late stage of training, most of the network updates are parallel to the saddle line (c.f. \ref{['fig4']}e). (c) Estimated probability of converging to a channel for various datasets and architectures spanning different input and hidden dimensions and number of hidden layers (see \ref{['app:channels']} for details). (d) Distribution of finite-norm minima and channel minima training loss. There is no evident difference between types of minima (see \ref{['app:channels']}). (e) Trajectories of maximum and minimum Hessian eigenvalues for finite-norm and channel minima. Channel solutions have larger maximum eigenvalues, indicating that they are sharper than the finite-norm minima. Channels sharpen as training progresses. They are extremely flat regions, as indicated by the small magnitude of the minimum eigenvalue. (f) For MLPs with three hidden layers, channels appear in all layers, and in multiple layers at the same time. (g) Channels do not always involve pairs of neurons, they can be formed by an arbitrary number of neurons with diverging output weights and converging input weights. In these cases, the flat regions are multi-dimensional (see \ref{['app:channels']} for details on multi-dimensional channels). (b-c-d-e) show results for the GP (s=0.5) dataset; see \ref{['app:channels']} for other datasets.
  • Figure 5: Convergence in $\epsilon$ to gated linear units. Moving along a channel to infinity with the jump procedure described in \ref{['app:theory']} shows that $c$, $a$, and the cosine similarity $\cos(\boldsymbol{\Delta}, \boldsymbol{w})$ converge to constant values, and that the loss and the approximation error decrease with $\epsilon^2$ and the sharpness diverges with $1/\epsilon^2$, as predicted by the theory. For this example, a network with 8 input dimensions and 8 hidden softplus neurons (81 parameters) trained on the rosenbrock target function was used.
  • ...and 14 more figures