Table of Contents
Fetching ...

Exploring the loss landscape of regularized neural networks via convex duality

Sungyoon Kim, Aaron Mishkin, Mert Pilanci

TL;DR

The paper develops a duality-based framework to analyze the loss landscape of regularized ReLU networks by castings the training problem into a convex cone form and studying its dual. It shows that for wide enough two-layer networks the problem is equivalent to a cone-constrained group LASSO, with the dual optimum \\nu^* determining fixed optimal directions and yielding a polyhedral description of the optimal set. A staircase of connectivity is established as the network width crosses critical thresholds, and nonunique minimum-norm interpolators are constructed, highlighting the role of regularization and architectural choices. The approach generalizes to vector-valued and parallel deep architectures, preserving finite sets of weight directions and extending connectivity results; the findings illuminate how regularization shapes the loss landscape and offer tools for understanding optimization dynamics in practice.

Abstract

We discuss several aspects of the loss landscape of regularized neural networks: the structure of stationary points, connectivity of optimal solutions, path with nonincreasing loss to arbitrary global optimum, and the nonuniqueness of optimal solutions, by casting the problem into an equivalent convex problem and considering its dual. Starting from two-layer neural networks with scalar output, we first characterize the solution set of the convex problem using its dual and further characterize all stationary points. With the characterization, we show that the topology of the global optima goes through a phase transition as the width of the network changes, and construct counterexamples where the problem may have a continuum of optimal solutions. Finally, we show that the solution set characterization and connectivity results can be extended to different architectures, including two-layer vector-valued neural networks and parallel three-layer neural networks.

Exploring the loss landscape of regularized neural networks via convex duality

TL;DR

The paper develops a duality-based framework to analyze the loss landscape of regularized ReLU networks by castings the training problem into a convex cone form and studying its dual. It shows that for wide enough two-layer networks the problem is equivalent to a cone-constrained group LASSO, with the dual optimum \\nu^* determining fixed optimal directions and yielding a polyhedral description of the optimal set. A staircase of connectivity is established as the network width crosses critical thresholds, and nonunique minimum-norm interpolators are constructed, highlighting the role of regularization and architectural choices. The approach generalizes to vector-valued and parallel deep architectures, preserving finite sets of weight directions and extending connectivity results; the findings illuminate how regularization shapes the loss landscape and offer tools for understanding optimization dynamics in practice.

Abstract

We discuss several aspects of the loss landscape of regularized neural networks: the structure of stationary points, connectivity of optimal solutions, path with nonincreasing loss to arbitrary global optimum, and the nonuniqueness of optimal solutions, by casting the problem into an equivalent convex problem and considering its dual. Starting from two-layer neural networks with scalar output, we first characterize the solution set of the convex problem using its dual and further characterize all stationary points. With the characterization, we show that the topology of the global optima goes through a phase transition as the width of the network changes, and construct counterexamples where the problem may have a continuum of optimal solutions. Finally, we show that the solution set characterization and connectivity results can be extended to different architectures, including two-layer vector-valued neural networks and parallel three-layer neural networks.

Paper Structure

This paper contains 20 sections, 47 theorems, 257 equations, 9 figures.

Key Result

Proposition 1

mishkin2023optimal Let the optimal solution set of equation eq:convex_twolayer_opt as $\Theta^{*}$. If the loss function $L$ is strictly convex, the optimal model fit is unique, i.e. the set of optimal model fit

Figures (9)

  • Figure 1: A schematic that illustrates the staircase of connectivity. This conceptual figure describes the topological change in solution sets as the number of neurons $m$ changes in a high-level manner. Connected components that are not singletons are shown as blue sets, whereas singletons are depicted as red dots. When $m = m^{*}$, there are only finitely many red dots. When $m \geq m^{*}+1$, there exists a connected component that is not a singleton, i.e. a blue set. When $m = M^{*}$, there exists a connected component which is a singleton, i.e. a red dot. When $m \geq M^{*}+1$, there is no red dot. At last, when $m \geq \min\{m^{*}+M^{*}, n+1\}$, there is a single blue set.
  • Figure 2: Staircase of connectivity for a toy example. The figures above the horizontal line show the toy problem's loss landscape as the width $m$ changes. The red star denotes a single optimal solution while the blue line denotes a continuum of optimal solutions. The figures below the horizontal line show the corresponding optimal functions. The red/blue functions correspond to the functions parametrized by the red/blue sets in the loss landscape. Note that when $m = 3 = \min\{m^{*}+M^{*}, n+1\}$, there exists a continuous deformation from one solution to another.
  • Figure 3: A demonstration of non-unique interpolators for $\textbf{n = 5}$.\ref{['fig3-1:construction']} shows the geometric construction behind finding $v$ s proposed in \ref{['p2:Class_Nonunique']}. \ref{['fig3-2:OptimalInterpolators_zoom']} shows the continuum of optimal interpolators, and \ref{['fig3-3:Learned_GD_interp']} shows the learned interpolators trained by gradient descent.
  • Figure 4: A contour plot of the loss landscape The three figures show the contour plot of the loss landscape shown in \ref{['fig2:example']}. We can see the staircase of connectivity more clearly.
  • Figure 5: Learned functions found by gradient descent The two figures show what functions gradient descent learns for the toy problem in \ref{['example:staircasecon']}. For both cases in $m = 3$, $m = 5$, either gradient descent gets stuck at a local minimum or finds one of the optimal networks in the continuum of optimal solutions.
  • ...and 4 more figures

Theorems & Definitions (97)

  • Proposition 1
  • Theorem 1
  • Corollary 1
  • Theorem 2
  • Corollary 2
  • Corollary 3
  • Example 1
  • Proposition 2
  • Proposition 3
  • Example 2
  • ...and 87 more