Global Convergence of Four-Layer Matrix Factorization under Random Initialization
Minrui Luo, Weihang Xu, Xiang Gao, Maryam Fazel, Simon Shaolei Du
TL;DR
This work proves a polynomial-time global convergence guarantee for gradient descent on a four-layer matrix factorization problem under random Gaussian initialization, extending global results beyond the NTK regime to a deeper linear network. The authors develop a three-stage training analysis (alignment, saddle avoidance, local convergence) and introduce novel techniques to bound eigenvalue changes and prevent saddle points, including a non-increasing skew-Hermitian error and a non-decreasing Hermitian main term. A balanced regularization term and random-matrix tools (Circular Ensemble concepts) enable precise initialization bounds and stage timings, yielding high-probability convergence for complex initializations and near-1/2 probability for real initializations. The results advance theoretical understanding of deep linear network training dynamics and offer a path toward global guarantees for general depth and target structures, with the caveat that the target is assumed to have identical singular values in the formal statements. The insights into eigenvalue dynamics and saddle-avoidance mechanisms could inform broader analyses of non-convex training dynamics in deep learning.
Abstract
Gradient descent dynamics on the deep matrix factorization problem is extensively studied as a simplified theoretical model for deep neural networks. Although the convergence theory for two-layer matrix factorization is well-established, no global convergence guarantee for general deep matrix factorization under random initialization has been established to date. To address this gap, we provide a polynomial-time global convergence guarantee for randomly initialized gradient descent on four-layer matrix factorization, given certain conditions on the target matrix and a standard balanced regularization term. Our analysis employs new techniques to show saddle-avoidance properties of gradient decent dynamics, and extends previous theories to characterize the change in eigenvalues of layer weights.
