Table of Contents
Fetching ...

A Library of Mirrors: Deep Neural Nets in Low Dimensions are Convex Lasso Models with Reflection Features

Emi Zeger, Yifei Wang, Aaron Mishkin, Tolga Ergen, Emmanuel Candès, Mert Pilanci

TL;DR

The paper shows that training deep neural networks on 1-D data can be recast as a convex Lasso problem with an explicit dictionary, enabling global-optimality analysis and tractable solution paths. The dictionary grows with depth and encodes piecewise-linear features, with reflection features appearing for ReLU and absolute value activations once depth is at least three; sign and threshold activations, in contrast, yield dictionaries lacking reflections. It provides concrete dictionaries for 2-layer and deeper architectures, reconstruction maps from Lasso solutions to optimal networks, and polynomial or combinatorial bounds on dictionary sizes. Empirically, the convex reformulation via Lasso (cvxNN) shows competitive training loss and generalization in autoregressive time-series tasks, corroborating the theoretical predictions and offering a scalable training paradigm for low-dimensional data.

Abstract

We prove that training neural networks on 1-D data is equivalent to solving convex Lasso problems with discrete, explicitly defined dictionary matrices. We consider neural networks with piecewise linear activations and depths ranging from 2 to an arbitrary but finite number of layers. We first show that two-layer networks with piecewise linear activations are equivalent to Lasso models using a discrete dictionary of ramp functions, with breakpoints corresponding to the training data points. In certain general architectures with absolute value or ReLU activations, a third layer surprisingly creates features that reflect the training data about themselves. Additional layers progressively generate reflections of these reflections. The Lasso representation provides valuable insights into the analysis of globally optimal networks, elucidating their solution landscapes and enabling closed-form solutions in certain special cases. Numerical results show that reflections also occur when optimizing standard deep networks using standard non-convex optimizers. Additionally, we demonstrate our theory with autoregressive time series models.

A Library of Mirrors: Deep Neural Nets in Low Dimensions are Convex Lasso Models with Reflection Features

TL;DR

The paper shows that training deep neural networks on 1-D data can be recast as a convex Lasso problem with an explicit dictionary, enabling global-optimality analysis and tractable solution paths. The dictionary grows with depth and encodes piecewise-linear features, with reflection features appearing for ReLU and absolute value activations once depth is at least three; sign and threshold activations, in contrast, yield dictionaries lacking reflections. It provides concrete dictionaries for 2-layer and deeper architectures, reconstruction maps from Lasso solutions to optimal networks, and polynomial or combinatorial bounds on dictionary sizes. Empirically, the convex reformulation via Lasso (cvxNN) shows competitive training loss and generalization in autoregressive time-series tasks, corroborating the theoretical predictions and offering a scalable training paradigm for low-dimensional data.

Abstract

We prove that training neural networks on 1-D data is equivalent to solving convex Lasso problems with discrete, explicitly defined dictionary matrices. We consider neural networks with piecewise linear activations and depths ranging from 2 to an arbitrary but finite number of layers. We first show that two-layer networks with piecewise linear activations are equivalent to Lasso models using a discrete dictionary of ramp functions, with breakpoints corresponding to the training data points. In certain general architectures with absolute value or ReLU activations, a third layer surprisingly creates features that reflect the training data about themselves. Additional layers progressively generate reflections of these reflections. The Lasso representation provides valuable insights into the analysis of globally optimal networks, elucidating their solution landscapes and enabling closed-form solutions in certain special cases. Numerical results show that reflections also occur when optimizing standard deep networks using standard non-convex optimizers. Additionally, we demonstrate our theory with autoregressive time series models.
Paper Structure (35 sections, 67 theorems, 78 equations, 24 figures)

This paper contains 35 sections, 67 theorems, 78 equations, 24 figures.

Key Result

Lemma 3.1

\newlabellemma:absvalReLUskipconnection0 The training problem for a $2$-layer network with skip connection and ReLU activation remains equivalent if the activation is changed to absolute value, and there is a map between the solutions for either activation.

Figures (24)

  • Figure 1: Example features, not including reversed directions, for deep narrow networks with absolute value activation. Top row: $3$-layer features. The top left feature contains a breakpoint at the reflection of $x_{j_2}$ (red) across $x_{j_1}$ (yellow), which is denoted as $R_{\left({x_{j_2}},{x_{j_1}}\right)}$ (red encircling yellow). Other breakpoints are colored similarly. Bottom row: an example of a $4$-layer feature, which contains a double reflection of $x_{j_3}$ (blue) reflected across $x_{j_2}$ (red), then reflected across $x_{j_1}$ (yellow), which is denoted as $R_{({R_{({x_{j_3}},{x_{j_2}})}},{x_{j_1}})}$ (blue encircling red, encircling yellow). All lines have slopes $\pm1$, and $x_{j_1},x_{j_2}, x_{j_3}$ are training data.
  • Figure 1: Comparison of neural autoregressive models of the form $x_{t}=f(x_{t-1};\theta)+\epsilon_t$ using convex and non-convex optimizers and the classical linear model AR(1) for time series forecasting. The horizontal axis is the training epoch. The dataset is BTC-2017min from Kaggle, which contains all $1$-minute Bitcoin prices in 2017 kaggle. The non-linear models outperform the linear AR(1) model. Moreover, SGD underperforms in training and test loss compared to the convex model which is guaranteed to find a global optimum of the NN objective.
  • Figure 1: Figure for \ref{['app:reconstrct']}. Deep library features for a $3$-layer symmetrized ReLU network. Each row corresponds to a different set of weights $\mathbf{W}^{\left({1}\right)} {\in} \{{-}1,1\}^{1\times 2},\mathbf{W}^{\left({2}\right)} {\in} \{{-}1,1\}^{2 \times 1}$. Each column corresponds to a different ordering of $x_i, x_j, x_k$. The generalized reflections of $x_{j_1}$ (yellow) across $x_{j_2}$ (red) and $x_{j_3}$ (blue) are depicted by yellow encircling purple.
  • Figure 1: Training objective
  • Figure 2: Lasso and Adam-trained deep narrow networks with absolute value activation. For $L{=}3$, the breakpoint at $2$ is not a training point; it is the reflection of $x_2{=}0$ across $x_1{=}1$. For $L{=}4$, the breakpoint at $6$ is not a training point; it is $x_2{=}0$ reflected about $x_3{=}{-}1$ to ${-}2$ (which not a training point) and then reflected across $x_1{=}2$. Similarly the $5$-layer network contains more complex reflections.
  • ...and 19 more figures

Theorems & Definitions (166)

  • Lemma 3.1
  • Theorem 3.2: Lasso equivalent of deep absolute value networks
  • Definition 3.3
  • Theorem 3.4: deep narrow ReLU network representation capability stagnates
  • Theorem 3.5: wider ReLU networks do not stagnate and generate reflections
  • Remark 3.6
  • Theorem 3.7
  • Lemma 3.8
  • Definition 3.9
  • Definition 3.10
  • ...and 156 more