Table of Contents
Fetching ...

Optimization Insights into Deep Diagonal Linear Networks

Hippolyte Labarrière, Cesare Molinari, Lorenzo Rosasco, Cristian Vega, Silvia Villa

TL;DR

The paper addresses why gradient-based optimization effectively trains highly parameterized, nonconvex systems by studying Deep Diagonal Linear Networks (DDLN) with an effective parameter $\theta = \bigodot_{j=1}^L u^j$. It shows that gradient flow on layer weights induces a mirror-flow dynamic in $\theta$, and proves convergence guarantees, including exponential decay under a Polyak-Łojasiewicz condition, with rates governed by initialization scale. A key finding is that the reparameterization yields favorable geometry: a convex entropy $\mathcal{Q}$ drives $\theta$ along a Mirror Flow, and under certain initializations the training dynamics can achieve fast convergence, with deeper networks offering acceleration in some regimes. The work clarifies how parametrization and initialization shape optimization in overparameterized settings, providing theoretical insight into the implicit bias and guiding initialization choices for improved training efficiency.

Abstract

Gradient-based methods successfully train highly overparameterized models in practice, even though the associated optimization problems are markedly nonconvex. Understanding the mechanisms that make such methods effective has become a central problem in modern optimization. To investigate this question in a tractable setting, we study Deep Diagonal Linear Networks. These are multilayer architectures with a reparameterization that preserves convexity in the effective parameter, while inducing a nontrivial geometry in the optimization landscape. Under mild initialization conditions, we show that gradient flow on the layer parameters induces a mirror-flow dynamic in the effective parameter space. This structural insight yields explicit convergence guarantees, including exponential decay of the loss under a Polyak-Lojasiewicz condition, and clarifies how the parametrization and initialization scale govern the training speed. Overall, our results demonstrate that deep diagonal over parameterizations, despite their apparent complexity, can endow standard gradient methods with well-behaved and interpretable optimization dynamics.

Optimization Insights into Deep Diagonal Linear Networks

TL;DR

The paper addresses why gradient-based optimization effectively trains highly parameterized, nonconvex systems by studying Deep Diagonal Linear Networks (DDLN) with an effective parameter . It shows that gradient flow on layer weights induces a mirror-flow dynamic in , and proves convergence guarantees, including exponential decay under a Polyak-Łojasiewicz condition, with rates governed by initialization scale. A key finding is that the reparameterization yields favorable geometry: a convex entropy drives along a Mirror Flow, and under certain initializations the training dynamics can achieve fast convergence, with deeper networks offering acceleration in some regimes. The work clarifies how parametrization and initialization shape optimization in overparameterized settings, providing theoretical insight into the implicit bias and guiding initialization choices for improved training efficiency.

Abstract

Gradient-based methods successfully train highly overparameterized models in practice, even though the associated optimization problems are markedly nonconvex. Understanding the mechanisms that make such methods effective has become a central problem in modern optimization. To investigate this question in a tractable setting, we study Deep Diagonal Linear Networks. These are multilayer architectures with a reparameterization that preserves convexity in the effective parameter, while inducing a nontrivial geometry in the optimization landscape. Under mild initialization conditions, we show that gradient flow on the layer parameters induces a mirror-flow dynamic in the effective parameter space. This structural insight yields explicit convergence guarantees, including exponential decay of the loss under a Polyak-Lojasiewicz condition, and clarifies how the parametrization and initialization scale govern the training speed. Overall, our results demonstrate that deep diagonal over parameterizations, despite their apparent complexity, can endow standard gradient methods with well-behaved and interpretable optimization dynamics.

Paper Structure

This paper contains 23 sections, 4 theorems, 66 equations, 4 figures.

Key Result

Proposition 1

Let $\left(u^j\right)_{j\in[L]}$ satisfy equation eq:GF_Llay and let $\theta=\bigodot_{j=1}^Lu^j$. Then the following statements hold:

Figures (4)

  • Figure 1: Representation of a Diagonal Linear Network
  • Figure 2: Representation of a Deep Diagonal Linear Network with $L$ layers in $\mathbb{R}^d$, $d=5$.
  • Figure 3: Dynamic of the nodes $(u^j_1(t))_{j\in[L]}$ for a loss function satisfying $\mathcal{L}:\theta\mapsto\|X\theta-y\|^2$ with $X\in\mathbb{R}^{10\times 5}$ and $y\in\mathbb{R}^{10}$ generated randomly, and $L=4$ layers.
  • Figure 4: Evolution of $\log\left(\mathcal{L}(\theta(t))-\mathcal{L}^*\right)$ according to time for three $6$-layer networks with different initialization. The loss function is defined as $\mathcal{L}:\theta\mapsto\|X\theta-y\|^2$ with $X\in\mathbb{R}^{10\times 8}$ and $y\in\mathbb{R}^{10}$ generated randomly. Each network is initialized with a first layer having components equal to $0$. The initial value of the remaining layers of the first network (in blue) is generated randomly, while that of the second (in orange) and the third (in green) are respectively equal to $1.4$ and $1.8$ times the values of the first component-wise.

Theorems & Definitions (4)

  • Proposition 1
  • Theorem 1
  • Theorem 2
  • Corollary 1