On the Benefits of Weight Normalization for Overparameterized Matrix Sensing
Yudong Wei, Liang Zhang, Bingcong Li, Niao He
TL;DR
The paper addresses recovering a low-rank PSD matrix $\mathbf{A}$ from linear measurements by applying generalized weight normalization (WN) to a matrix factorization via polar decomposition. It develops a Riemannian optimization scheme (RGd) on the Stiefel manifold for the direction $\mathbf{X}$ and gradient-based updates for the magnitude $\mathbf{\Theta}$, leading to a two-phase convergence: a saddle-escape phase followed by linear convergence. The main contributions are (i) an exponential improvement in convergence rate over standard gradient methods, (ii) polynomial improvements in iteration and sample complexity with higher overparameterization, and (iii) extensive numerical validation on synthetic and real data, including image reconstruction. The results provide theoretical and empirical evidence that overparameterization, when combined with weight normalization, can be leveraged to accelerate nonconvex matrix sensing and potentially other learning problems.
Abstract
While normalization techniques are widely used in deep learning, their theoretical understanding remains relatively limited. In this work, we establish the benefits of (generalized) weight normalization (WN) applied to the overparameterized matrix sensing problem. We prove that WN with Riemannian optimization achieves linear convergence, yielding an exponential speedup over standard methods that do not use WN. Our analysis further demonstrates that both iteration and sample complexity improve polynomially as the level of overparameterization increases. To the best of our knowledge, this work provides the first characterization of how WN leverages overparameterization for faster convergence in matrix sensing.
