Table of Contents
Fetching ...

Solving Inverse Problems with Deep Linear Neural Networks: Global Convergence Guarantees for Gradient Descent with Weight Decay

Hannah Laus, Suzanna Parkinson, Vasileios Charisopoulos, Felix Krahmer, Rebecca Willett

TL;DR

The paper analyzes underdetermined linear inverse problems solved by deep linear networks trained with gradient descent and weight decay. It proves that such training automatically learns an approximate inverse that respects latent subspace structure, with robust performance on the subspace and controlled off-subspace behavior, under standard RIP-like assumptions and initialization. A three-phase convergence argument shows fast initial reconstruction followed by stabilization and eventual off-subspace convergence, with explicit bounds linking reconstruction and robustness to noise to the regularization strength and network width. The results highlight the regularization/overparameterization tradeoff: weight decay improves robustness and generalization, while deeper networks accelerate convergence, offering a principled explanation for observed empirical benefits and guiding future nonlinear extensions.

Abstract

Machine learning methods are commonly used to solve inverse problems, wherein an unknown signal must be estimated from few indirect measurements generated via a known acquisition procedure. In particular, neural networks perform well empirically but have limited theoretical guarantees. In this work, we study an underdetermined linear inverse problem that admits several possible solution operators that map measurements to estimates of the target signal. A standard remedy (e.g., in compressed sensing) for establishing the uniqueness of the solution mapping is to assume the existence of a latent low-dimensional structure in the source signal. We ask the following question: do deep linear neural networks adapt to unknown low-dimensional structure when trained by gradient descent with weight decay regularization? We prove that mildly overparameterized deep linear networks trained in this manner converge to an approximate solution mapping that accurately solves the inverse problem while implicitly encoding latent subspace structure. We show rigorously that deep linear networks trained with weight decay automatically adapt to latent subspace structure in the data under practical stepsize and weight initialization schemes. Our work highlights that regularization and overparameterization improve generalization, while overparameterization also accelerates convergence during training.

Solving Inverse Problems with Deep Linear Neural Networks: Global Convergence Guarantees for Gradient Descent with Weight Decay

TL;DR

The paper analyzes underdetermined linear inverse problems solved by deep linear networks trained with gradient descent and weight decay. It proves that such training automatically learns an approximate inverse that respects latent subspace structure, with robust performance on the subspace and controlled off-subspace behavior, under standard RIP-like assumptions and initialization. A three-phase convergence argument shows fast initial reconstruction followed by stabilization and eventual off-subspace convergence, with explicit bounds linking reconstruction and robustness to noise to the regularization strength and network width. The results highlight the regularization/overparameterization tradeoff: weight decay improves robustness and generalization, while deeper networks accelerate convergence, offering a principled explanation for observed empirical benefits and guiding future nonlinear extensions.

Abstract

Machine learning methods are commonly used to solve inverse problems, wherein an unknown signal must be estimated from few indirect measurements generated via a known acquisition procedure. In particular, neural networks perform well empirically but have limited theoretical guarantees. In this work, we study an underdetermined linear inverse problem that admits several possible solution operators that map measurements to estimates of the target signal. A standard remedy (e.g., in compressed sensing) for establishing the uniqueness of the solution mapping is to assume the existence of a latent low-dimensional structure in the source signal. We ask the following question: do deep linear neural networks adapt to unknown low-dimensional structure when trained by gradient descent with weight decay regularization? We prove that mildly overparameterized deep linear networks trained in this manner converge to an approximate solution mapping that accurately solves the inverse problem while implicitly encoding latent subspace structure. We show rigorously that deep linear networks trained with weight decay automatically adapt to latent subspace structure in the data under practical stepsize and weight initialization schemes. Our work highlights that regularization and overparameterization improve generalization, while overparameterization also accelerates convergence during training.

Paper Structure

This paper contains 59 sections, 45 theorems, 255 equations, 5 figures.

Key Result

Lemma 1

Given a linear operator $W \in \mathbb{R}^{d \times m}$, the operator norm distance between $W$ and the oracle mapping $W_\mathrm{oracle} \in \mathbb{R}^{d \times m}$ defined in eq:oracle will satisfy where $P_{\mathrm{range}(Y)}$ denotes the orthogonal projection onto the range of $Y$.

Figures (5)

  • Figure 1: Effect of weight decay on robustness against Gaussian noise, $\epsilon \sim \mathcal{N},(0,\sigma^2)$ at test time. Experiments use signals of dimension $d = 256$ lying on a single subspace of dimension $s = 16$ (\ref{['fig:wd-robustness-linear']}) or a union of $k=3$ subspaces of size $s=4$ each (\ref{['fig:wd-robustness-nonlinear']}) and $m = 128$ measurements; all networks have $L = 5$ layers and hidden layer width $d_{w} = 4096$. In both cases, adding sufficient $\ell_2$-regularization facilitates adaptation to the low-dimensional structure and thus significantly improves robustness, but too much regularization results in a poor fit to the data and hence poor robustness. For a detailed description of the model in \ref{['fig:wd-robustness-nonlinear']}, see \ref{['sec:appendix numerical description']}.
  • Figure 2: Reconstruction and off-subspace errors for gradient descent across different step sizes $\eta := k \cdot m/L \cdot \sigma_{\max}^2(X)$. We observe iterates diverging when the multiplicative pre-factor $k$ is larger than $5$. Note that \ref{['fig:stepsize-sweep-a']} spans the first $10,000$ iterations. The vertical dashed line in \ref{['fig:stepsize-sweep-a']} indicates the value of $\tau_{\mathsf{ub}}$ as prescribed by \ref{['eq:tau-ub']} for the case of $k=1$; our theory correctly predicts that the reconstruction error will rebound by iteration $\tau_{\mathsf{ub}}$. In \ref{['fig:stepsize-sweep-b']}, we see that the off-subspace error slowly decays as the learned network adapts to the latent low-dimensional structure in $X$.
  • Figure 3: Comparing the training error of a deep linear neural network for data of varying subspace dimensions $s$ using constant stepsize $\eta = 1/10$ and weight decay $\lambda = 10^{-3}$. The lines are the median over $10$ runs with independently sampled training data and weight initializations. The shaded region indicates one standard deviation around the median. See \ref{['sec:impact of s']} for details.
  • Figure 4: Normalized reconstruction error and off-subspace error for deep linear nets of varying depths $L$, trained with gradient descent using constant stepsize $\eta = 1/10$ and weight decay parameter $\lambda = 10^{-4}$. While the reconstruction error drops to similar levels for all depths, larger $L$ confers a clear advantage with respect to the off-subspace error. See \ref{['sec:subsec:depth']} for details.
  • Figure 5: Normalized reconstruction error and off-subspace errors for deep linear nets trained with gradient descent with stepsize $\eta = 1/10$ and varying levels of weight decay $\lambda$. While high levels of weight decay reduce the off-subspace error faster, they lead to larger reconstruction error. See \ref{['sec:subsec:wd']} for details.

Theorems & Definitions (52)

  • Lemma 1
  • Theorem 2.1
  • Remark 1
  • Remark 2
  • Corollary 1
  • Remark 3
  • Corollary 2
  • Remark 4
  • Theorem 4.1: Generalized main theorem
  • Remark 5
  • ...and 42 more