Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations
Yuanzhi Li, Tengyu Ma, Hongyang Zhang
TL;DR
The paper analyzes how gradient-based optimization implicitly regularizes over-parameterized models in matrix sensing and neural networks with quadratic activations, under restricted isometry properties. By factorizing the target PSD matrix X* as UU^T and using a small initialization, gradient descent converges to the true low-rank solution at near-linear rates, yielding generalization bounds that depend on initialization rather than parameter count. It extends these insights to one-hidden-layer quadratic networks, showing that with truncation to enforce RIP-like behavior, similar generalization guarantees hold with favorable sample complexity. The work combines RIP-based concentration, low-rank dynamics, and adaptive subspace analysis to establish algorithmic regularization without heavy reliance on early stopping, and it is reinforced by simulations highlighting initialization effects and the superiority of factorized GD over naive PSD-projection approaches.
Abstract
We show that the gradient descent algorithm provides an implicit regularization effect in the learning of over-parameterized matrix factorization models and one-hidden-layer neural networks with quadratic activations. Concretely, we show that given $\tilde{O}(dr^{2})$ random linear measurements of a rank $r$ positive semidefinite matrix $X^{\star}$, we can recover $X^{\star}$ by parameterizing it by $UU^\top$ with $U\in \mathbb R^{d\times d}$ and minimizing the squared loss, even if $r \ll d$. We prove that starting from a small initialization, gradient descent recovers $X^{\star}$ in $\tilde{O}(\sqrt{r})$ iterations approximately. The results solve the conjecture of Gunasekar et al.'17 under the restricted isometry property. The technique can be applied to analyzing neural networks with one-hidden-layer quadratic activations with some technical modifications.
