Table of Contents
Fetching ...

Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations

Yuanzhi Li, Tengyu Ma, Hongyang Zhang

TL;DR

The paper analyzes how gradient-based optimization implicitly regularizes over-parameterized models in matrix sensing and neural networks with quadratic activations, under restricted isometry properties. By factorizing the target PSD matrix X* as UU^T and using a small initialization, gradient descent converges to the true low-rank solution at near-linear rates, yielding generalization bounds that depend on initialization rather than parameter count. It extends these insights to one-hidden-layer quadratic networks, showing that with truncation to enforce RIP-like behavior, similar generalization guarantees hold with favorable sample complexity. The work combines RIP-based concentration, low-rank dynamics, and adaptive subspace analysis to establish algorithmic regularization without heavy reliance on early stopping, and it is reinforced by simulations highlighting initialization effects and the superiority of factorized GD over naive PSD-projection approaches.

Abstract

We show that the gradient descent algorithm provides an implicit regularization effect in the learning of over-parameterized matrix factorization models and one-hidden-layer neural networks with quadratic activations. Concretely, we show that given $\tilde{O}(dr^{2})$ random linear measurements of a rank $r$ positive semidefinite matrix $X^{\star}$, we can recover $X^{\star}$ by parameterizing it by $UU^\top$ with $U\in \mathbb R^{d\times d}$ and minimizing the squared loss, even if $r \ll d$. We prove that starting from a small initialization, gradient descent recovers $X^{\star}$ in $\tilde{O}(\sqrt{r})$ iterations approximately. The results solve the conjecture of Gunasekar et al.'17 under the restricted isometry property. The technique can be applied to analyzing neural networks with one-hidden-layer quadratic activations with some technical modifications.

Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations

TL;DR

The paper analyzes how gradient-based optimization implicitly regularizes over-parameterized models in matrix sensing and neural networks with quadratic activations, under restricted isometry properties. By factorizing the target PSD matrix X* as UU^T and using a small initialization, gradient descent converges to the true low-rank solution at near-linear rates, yielding generalization bounds that depend on initialization rather than parameter count. It extends these insights to one-hidden-layer quadratic networks, showing that with truncation to enforce RIP-like behavior, similar generalization guarantees hold with favorable sample complexity. The work combines RIP-based concentration, low-rank dynamics, and adaptive subspace analysis to establish algorithmic regularization without heavy reliance on early stopping, and it is reinforced by simulations highlighting initialization effects and the superiority of factorized GD over naive PSD-projection approaches.

Abstract

We show that the gradient descent algorithm provides an implicit regularization effect in the learning of over-parameterized matrix factorization models and one-hidden-layer neural networks with quadratic activations. Concretely, we show that given random linear measurements of a rank positive semidefinite matrix , we can recover by parameterizing it by with and minimizing the squared loss, even if . We prove that starting from a small initialization, gradient descent recovers in iterations approximately. The results solve the conjecture of Gunasekar et al.'17 under the restricted isometry property. The technique can be applied to analyzing neural networks with one-hidden-layer quadratic activations with some technical modifications.

Paper Structure

This paper contains 29 sections, 34 theorems, 208 equations, 5 figures, 1 algorithm.

Key Result

Theorem 1.1

Let $c$ be a sufficiently small absolute constant. Assume that the set of measurement matrices $(A_1,\dots, A_m)$ satisfies $(4r,\delta)$-restricted isometry property (defined in Section sec:prelim formally) with $\delta \le c/(\kappa^3\sqrt{r}\log^2 d)$. Suppose the initialization and learning rate

Figures (5)

  • Figure 1: Generalization performance depends on the choice of initialization: the gap between training and test error decreases as $\alpha$ decreases. Here the number of samples is $5dr$, where rank $r = 5$. We initialize with $\alpha \mathop{\mathrm{Id}}\nolimits$, and run $10^4$ iterations with step size $0.0025$.
  • Figure 2: Further comparison between the generalization performance of large versus small initializations. We plot the data points from iteration 500 onwards to simplify the scale of the y-axis. The step size is $0.0025$.
  • Figure 3: Test error keeps decreasing as the number of iterations goes to $10^5$. Here the number of samples is $m = 5dr$, where rank $r = 5$. Note that the initial test error is approximately $1$.
  • Figure 4: Projected gradient descent (PGD) requires more samples to recover ${X}^\star$ accurately, than gradient descent on the factorized model. Moreover, the performance of PGD gets worse as $d$ increases.
  • Figure 5: Stochastic gradient descent, when initialized with the identity matrix, does not generalize to test data. Here $d = 100$ and $r = 5$.

Theorems & Definitions (68)

  • Theorem 1.1
  • Theorem 1.2
  • Definition 2.1
  • Lemma 2.2
  • Lemma 2.3
  • Theorem 3.1
  • Proposition 3.2: Error dynamics
  • proof : Proof Sketch of Proposition \ref{['prop:rank1_Et']}
  • Proposition 3.3: Signal dynamics
  • proof
  • ...and 58 more