Optimization Insights into Deep Diagonal Linear Networks
Hippolyte Labarrière, Cesare Molinari, Lorenzo Rosasco, Cristian Vega, Silvia Villa
TL;DR
The paper addresses why gradient-based optimization effectively trains highly parameterized, nonconvex systems by studying Deep Diagonal Linear Networks (DDLN) with an effective parameter $\theta = \bigodot_{j=1}^L u^j$. It shows that gradient flow on layer weights induces a mirror-flow dynamic in $\theta$, and proves convergence guarantees, including exponential decay under a Polyak-Łojasiewicz condition, with rates governed by initialization scale. A key finding is that the reparameterization yields favorable geometry: a convex entropy $\mathcal{Q}$ drives $\theta$ along a Mirror Flow, and under certain initializations the training dynamics can achieve fast convergence, with deeper networks offering acceleration in some regimes. The work clarifies how parametrization and initialization shape optimization in overparameterized settings, providing theoretical insight into the implicit bias and guiding initialization choices for improved training efficiency.
Abstract
Gradient-based methods successfully train highly overparameterized models in practice, even though the associated optimization problems are markedly nonconvex. Understanding the mechanisms that make such methods effective has become a central problem in modern optimization. To investigate this question in a tractable setting, we study Deep Diagonal Linear Networks. These are multilayer architectures with a reparameterization that preserves convexity in the effective parameter, while inducing a nontrivial geometry in the optimization landscape. Under mild initialization conditions, we show that gradient flow on the layer parameters induces a mirror-flow dynamic in the effective parameter space. This structural insight yields explicit convergence guarantees, including exponential decay of the loss under a Polyak-Lojasiewicz condition, and clarifies how the parametrization and initialization scale govern the training speed. Overall, our results demonstrate that deep diagonal over parameterizations, despite their apparent complexity, can endow standard gradient methods with well-behaved and interpretable optimization dynamics.
