Table of Contents
Fetching ...

Global Convergence of Four-Layer Matrix Factorization under Random Initialization

Minrui Luo, Weihang Xu, Xiang Gao, Maryam Fazel, Simon Shaolei Du

TL;DR

This work proves a polynomial-time global convergence guarantee for gradient descent on a four-layer matrix factorization problem under random Gaussian initialization, extending global results beyond the NTK regime to a deeper linear network. The authors develop a three-stage training analysis (alignment, saddle avoidance, local convergence) and introduce novel techniques to bound eigenvalue changes and prevent saddle points, including a non-increasing skew-Hermitian error and a non-decreasing Hermitian main term. A balanced regularization term and random-matrix tools (Circular Ensemble concepts) enable precise initialization bounds and stage timings, yielding high-probability convergence for complex initializations and near-1/2 probability for real initializations. The results advance theoretical understanding of deep linear network training dynamics and offer a path toward global guarantees for general depth and target structures, with the caveat that the target is assumed to have identical singular values in the formal statements. The insights into eigenvalue dynamics and saddle-avoidance mechanisms could inform broader analyses of non-convex training dynamics in deep learning.

Abstract

Gradient descent dynamics on the deep matrix factorization problem is extensively studied as a simplified theoretical model for deep neural networks. Although the convergence theory for two-layer matrix factorization is well-established, no global convergence guarantee for general deep matrix factorization under random initialization has been established to date. To address this gap, we provide a polynomial-time global convergence guarantee for randomly initialized gradient descent on four-layer matrix factorization, given certain conditions on the target matrix and a standard balanced regularization term. Our analysis employs new techniques to show saddle-avoidance properties of gradient decent dynamics, and extends previous theories to characterize the change in eigenvalues of layer weights.

Global Convergence of Four-Layer Matrix Factorization under Random Initialization

TL;DR

This work proves a polynomial-time global convergence guarantee for gradient descent on a four-layer matrix factorization problem under random Gaussian initialization, extending global results beyond the NTK regime to a deeper linear network. The authors develop a three-stage training analysis (alignment, saddle avoidance, local convergence) and introduce novel techniques to bound eigenvalue changes and prevent saddle points, including a non-increasing skew-Hermitian error and a non-decreasing Hermitian main term. A balanced regularization term and random-matrix tools (Circular Ensemble concepts) enable precise initialization bounds and stage timings, yielding high-probability convergence for complex initializations and near-1/2 probability for real initializations. The results advance theoretical understanding of deep linear network training dynamics and offer a path toward global guarantees for general depth and target structures, with the caveat that the target is assumed to have identical singular values in the formal statements. The insights into eigenvalue dynamics and saddle-avoidance mechanisms could inform broader analyses of non-convex training dynamics in deep learning.

Abstract

Gradient descent dynamics on the deep matrix factorization problem is extensively studied as a simplified theoretical model for deep neural networks. Although the convergence theory for two-layer matrix factorization is well-established, no global convergence guarantee for general deep matrix factorization under random initialization has been established to date. To address this gap, we provide a polynomial-time global convergence guarantee for randomly initialized gradient descent on four-layer matrix factorization, given certain conditions on the target matrix and a standard balanced regularization term. Our analysis employs new techniques to show saddle-avoidance properties of gradient decent dynamics, and extends previous theories to characterize the change in eigenvalues of layer weights.

Paper Structure

This paper contains 53 sections, 58 theorems, 325 equations, 4 figures.

Key Result

Theorem 1

Consider four-layer matrix factorization under gradient descent, random Gaussian initialization with scaling factor $\epsilon \le \sigma_1^{1/4}(\Sigma) / {\rm poly}( 1 / \delta, d)$, regularization factor $a \ge \sigma_1(\Sigma) \cdot {\rm poly}\left( 1 / \delta, d, \ln\left(\sigma_1^{1/4}(\Sigma)/

Figures (4)

  • Figure 1: Dynamics of singular values (log scale) for an identity target matrix. From left to right, up to down: real initialization with $\det(U^\top V) = 1$, $\det(U^\top V) = -1$, and complex initialization.
  • Figure 2: Dynamics of singular values (log scale) for a non-identity target matrix. From left to right, up to down: real initialization with $\det(U^\top V) = 1$, $\det(U^\top V) = -1$, and complex initialization.
  • Figure 3: Dynamics of extreme singular values (log scale) for four weight matrices.
  • Figure 4: Dynamics of the minimum singular value of hermitian main term $W_1 + W_2^{-1} W_3^H W_4^H$ (log scale). From left to right, up to down: real initialization with $\det(W) > 0$, $\det(W) < 0$, and complex initialization.

Theorems & Definitions (123)

  • Theorem 1: Main theorem, informal
  • Remark 1
  • Remark 2
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Remark 3
  • Theorem 5
  • Theorem 6
  • Theorem 7
  • ...and 113 more