Table of Contents
Fetching ...

On the Convergence of Overparameterized Problems: Inherent Properties of the Compositional Structure of Neural Networks

Arthur Castello Branco de Oliveira, Dhruv Jatkar, Eduardo Sontag

TL;DR

The paper analyzes how the compositional structure of overparameterized neural networks shapes optimization by treating training as gradient flow on a deep linear network. It proves that for any proper real-analytic cost $f$, invariant quantities $\mathscr{C}$ induce a foliation of the parameter space into invariant manifolds, ensuring convergence to critical points of the overparameterized objective $g$; for scalar costs, the center-stable geometry is universal and the convergence rate depends on initialization imbalance. In the vector (one-hidden-layer) case, trajectories converge to the set where $f'(w_2w_1)=0$, with convergence guaranteed outside a measure-zero set of initializations, and accelerated convergence is shown to follow from the imbalance measure $c$. A proof-of-concept extension to sigmoidal activations indicates that the qualitative geometric structure persists beyond linear activations. These results advance understanding of how network composition governs training dynamics and suggest principled avenues for faster optimization and generalization in overparameterized models.

Abstract

This paper investigates how the compositional structure of neural networks shapes their optimization landscape and training dynamics. We analyze the gradient flow associated with overparameterized optimization problems, which can be interpreted as training a neural network with linear activations. Remarkably, we show that the global convergence properties can be derived for any cost function that is proper and real analytic. We then specialize the analysis to scalar-valued cost functions, where the geometry of the landscape can be fully characterized. In this setting, we demonstrate that key structural features -- such as the location and stability of saddle points -- are universal across all admissible costs, depending solely on the overparameterized representation rather than on problem-specific details. Moreover, we show that convergence can be arbitrarily accelerated depending on the initialization, as measured by an imbalance metric introduced in this work. Finally, we discuss how these insights may generalize to neural networks with sigmoidal activations, showing through a simple example which geometric and dynamical properties persist beyond the linear case.

On the Convergence of Overparameterized Problems: Inherent Properties of the Compositional Structure of Neural Networks

TL;DR

The paper analyzes how the compositional structure of overparameterized neural networks shapes optimization by treating training as gradient flow on a deep linear network. It proves that for any proper real-analytic cost , invariant quantities induce a foliation of the parameter space into invariant manifolds, ensuring convergence to critical points of the overparameterized objective ; for scalar costs, the center-stable geometry is universal and the convergence rate depends on initialization imbalance. In the vector (one-hidden-layer) case, trajectories converge to the set where , with convergence guaranteed outside a measure-zero set of initializations, and accelerated convergence is shown to follow from the imbalance measure . A proof-of-concept extension to sigmoidal activations indicates that the qualitative geometric structure persists beyond linear activations. These results advance understanding of how network composition governs training dynamics and suggest principled avenues for faster optimization and generalization in overparameterized models.

Abstract

This paper investigates how the compositional structure of neural networks shapes their optimization landscape and training dynamics. We analyze the gradient flow associated with overparameterized optimization problems, which can be interpreted as training a neural network with linear activations. Remarkably, we show that the global convergence properties can be derived for any cost function that is proper and real analytic. We then specialize the analysis to scalar-valued cost functions, where the geometry of the landscape can be fully characterized. In this setting, we demonstrate that key structural features -- such as the location and stability of saddle points -- are universal across all admissible costs, depending solely on the overparameterized representation rather than on problem-specific details. Moreover, we show that convergence can be arbitrarily accelerated depending on the initialization, as measured by an imbalance metric introduced in this work. Finally, we discuss how these insights may generalize to neural networks with sigmoidal activations, showing through a simple example which geometric and dynamical properties persist beyond the linear case.

Paper Structure

This paper contains 15 sections, 14 theorems, 65 equations, 3 figures.

Key Result

proposition 1

The value of the invariant $\mathcal{C}$ along any solution of the overparameterized gradient flow equation eq:gradflowOVP-def is invariant, i.e.

Figures (3)

  • Figure 1: Depiction of a linear neural network.
  • Figure 2: Illustration of the foliation of the state space of overparameterized optimization problems. The black curve illustrates a branch of the set of critical points given by $\mathbf{W}(W_1,\dots, W_N) = W^*$, and the sets $\mathscr{C}_i$ are the invariant manifold of the dynamics. Notice that the manifolds are invariant and do not intersect with each other, but every point in the parameter space is within one of such manifolds, resulting in a "foliated" optimization landscape. Global properties of the training can, then, be shown by simply showing that local properties in the manifold hold for all manifolds.
  • Figure 3: Illustration of the effect of sigmoidal activations to the optimization landscape of neural network training for factorization problems. The left figure displays the optimization landscape for the scalar factorization problem trained with linear neural networks, while the right one considers the same problem, but with sigmoidal networks. In solid black are displayed the target sets (global optima) for each problem; in red the center-stable manifolds of the saddle at the origin; and in blue its unstable manifold. Notice that the qualitative description of the parameter space remains unchanged: measure zero center-stable manifold for the saddle and almost everywhere convergence to the target. Despite that, notice that the center stable and unstable manifolds on the left figure segment the parameter space into four "equivalent" invariant subspaces, while on the right figure they segment the parameter space into four regions but of two different "types".

Theorems & Definitions (24)

  • definition 1
  • proposition 1
  • theorem 1
  • theorem 2
  • proposition 2
  • corollary 1
  • proposition 3
  • proposition 4
  • lemma 1
  • proof
  • ...and 14 more