Table of Contents
Fetching ...

Diagonal Linear Networks and the Lasso Regularization Path

Raphaël Berthier

TL;DR

This work reveals a deep dynamical link between training diagonal linear networks (DLNs) and the classical lasso path. By analyzing gradient flow in the small-initialization regime, the authors show that the time-rescaled, averaged DLN trajectory tracks the lasso regularization path, with the rescaled time playing the role of the inverse regularization parameter. They establish exact connections under a monotonicity assumption on the lasso path and provide quantified approximate connections otherwise, through a robust framework that uses mirror-flow interpretation and linear complementarity problems. The analysis covers both the uv and the u^2 parametrizations, employs systematic reductions from uv to u^2, and is complemented by simulations illustrating the sparsity-data-fit trade-off of early stopping. The results contribute a dynamical, path-following perspective to implicit regularization in DLNs and offer a blueprint for extending such analyses to more complex network architectures via similar reductions.

Abstract

Diagonal linear networks are neural networks with linear activation and diagonal weight matrices. Their theoretical interest is that their implicit regularization can be rigorously analyzed: from a small initialization, the training of diagonal linear networks converges to the linear predictor with minimal 1-norm among minimizers of the training loss. In this paper, we deepen this analysis showing that the full training trajectory of diagonal linear networks is closely related to the lasso regularization path. In this connection, the training time plays the role of an inverse regularization parameter. Both rigorous results and simulations are provided to illustrate this conclusion. Under a monotonicity assumption on the lasso regularization path, the connection is exact while in the general case, we show an approximate connection.

Diagonal Linear Networks and the Lasso Regularization Path

TL;DR

This work reveals a deep dynamical link between training diagonal linear networks (DLNs) and the classical lasso path. By analyzing gradient flow in the small-initialization regime, the authors show that the time-rescaled, averaged DLN trajectory tracks the lasso regularization path, with the rescaled time playing the role of the inverse regularization parameter. They establish exact connections under a monotonicity assumption on the lasso path and provide quantified approximate connections otherwise, through a robust framework that uses mirror-flow interpretation and linear complementarity problems. The analysis covers both the uv and the u^2 parametrizations, employs systematic reductions from uv to u^2, and is complemented by simulations illustrating the sparsity-data-fit trade-off of early stopping. The results contribute a dynamical, path-following perspective to implicit regularization in DLNs and offer a blueprint for extending such analyses to more complex network architectures via similar reductions.

Abstract

Diagonal linear networks are neural networks with linear activation and diagonal weight matrices. Their theoretical interest is that their implicit regularization can be rigorously analyzed: from a small initialization, the training of diagonal linear networks converges to the linear predictor with minimal 1-norm among minimizers of the training loss. In this paper, we deepen this analysis showing that the full training trajectory of diagonal linear networks is closely related to the lasso regularization path. In this connection, the training time plays the role of an inverse regularization parameter. Both rigorous results and simulations are provided to illustrate this conclusion. Under a monotonicity assumption on the lasso regularization path, the connection is exact while in the general case, we show an approximate connection.

Paper Structure

This paper contains 24 sections, 16 theorems, 99 equations, 1 figure.

Key Result

Theorem 2.1

For all $\mu >0$, let ${x}(\mu)$ denote a minimizer of the lasso Assume that $\mu > 0 \mapsto \mu{x}(\mu)$ is coordinate-wise monotone. Then

Figures (1)

  • Figure 1: Comparison between the average trajectory $\overline{x}^\varepsilon(s)$ of DLNs and the lasso regularization path $x(\mu)$. In the subfigure (a), problems instances are generated randomly conditionally on the monotonicity of $\mu \mapsto \mu x(\mu)$. Conversely, in the subfigure (b), problems instances are generated randomly conditionally on the non-monotonicity of $\mu \mapsto \mu x(\mu)$. We provide two instances of each case, one in each column. In each instance, we plot the coordinates $\overline{x}^\varepsilon_i(s)$ along with $x_i(\mu)$, and the suboptimality gap $\left(\mathop{\mathrm{Lasso}}\nolimits\left(\overline{x}^\varepsilon(s), s\right) - \mathop{\mathrm{Lasso}}\nolimits_*\left(s\right)\right)/\mathop{\mathrm{Lasso}}\nolimits_*\left(s\right)$. Simulation details are provided in Sec. \ref{['sec:simulations']}.

Theorems & Definitions (28)

  • Theorem 2.1
  • Theorem 2.2
  • Theorem 3.1
  • Theorem 3.2
  • Proposition 4.1
  • Lemma 4.2
  • proof
  • Lemma 4.3
  • proof
  • Lemma 4.4
  • ...and 18 more