Towards Resolving the Implicit Bias of Gradient Descent for Matrix Factorization: Greedy Low-Rank Learning
Zhiyuan Li, Yuping Luo, Kaifeng Lyu
TL;DR
The paper investigates the implicit regularization of gradient descent in matrix factorization, challenging norm-based characterizations. It proves that gradient flow with infinitesimal initialization is generically equivalent to a greedy low-rank learning procedure (GLRL) for depth-2 models and extends the analysis to deeper factorizations, where depth enhances the likelihood of rank minimization despite practical initialization scales. A key contribution is a concrete counterexample to the nuclear-norm conjecture, demonstrating that GD can favor low-rank solutions not captured by nuclear norm minimization alone. The framework links end-to-end gradient dynamics to a phased, rank-increase algorithm and provides both theoretical and empirical support that depth yields multi-phase low-rank growth, offering a more expressive picture of implicit regularization beyond norm-based descriptions. These insights pave the way for broader applications of GLRL in understanding optimization biases and for exploring extensions to deep neural networks.
Abstract
Matrix factorization is a simple and natural test-bed to investigate the implicit regularization of gradient descent. Gunasekar et al. (2017) conjectured that Gradient Flow with infinitesimal initialization converges to the solution that minimizes the nuclear norm, but a series of recent papers argued that the language of norm minimization is not sufficient to give a full characterization for the implicit regularization. In this work, we provide theoretical and empirical evidence that for depth-2 matrix factorization, gradient flow with infinitesimal initialization is mathematically equivalent to a simple heuristic rank minimization algorithm, Greedy Low-Rank Learning, under some reasonable assumptions. This generalizes the rank minimization view from previous works to a much broader setting and enables us to construct counter-examples to refute the conjecture from Gunasekar et al. (2017). We also extend the results to the case where depth $\ge 3$, and we show that the benefit of being deeper is that the above convergence has a much weaker dependence over initialization magnitude so that this rank minimization is more likely to take effect for initialization with practical scale.
