Eigen-componentwise convergence of SGD on quadratic programming
Lehan Chen, Yuji Nakatsukasa
TL;DR
The paper analyzes how SGD converges along eigen-directions of the Hessian for linear LS problems, revealing that large-eigenvalue components tend to contract faster in early iterations and that a phase transition can slow component-wise convergence as iterations proceed. It derives explicit bounds for both the mean and second-m moment of each component under fixed and decaying step sizes, extending Steinerberger's insights from randomized Kaczmarz to general SGD and to inconsistent LS, where convergence to the exact solution requires diminishing step sizes. The results cover three step-size regimes (fixed, 1/k, and 1/k^\gamma with 1/2<\gamma<1) and show how spectral properties, conditioning, and step-size choices jointly shape both the initial fast decay of dominant components and the eventual slower asymptotic behavior due to variance. Practically, the findings inform step-size selection and interpretation of SGD dynamics in high-dimensional LS problems, clarifying when and why early iterations are most informative and how phase transitions manifest in the error decomposition.
Abstract
Stochastic gradient descent (SGD) is a workhorse algorithm for solving large-scale optimization problems in data science and machine learning. Understanding the convergence of SGD is hence of fundamental importance. In this work we examine the SGD convergence (with various step sizes) when applied to unconstrained convex quadratic programming (essentially least-squares (LS) problems), and in particular analyze the error components respect to the eigenvectors of the Hessian. The main message is that the convergence depends largely on the corresponding eigenvalues (singular values of the coefficient matrix in the LS context), namely the components for the large singular values converge faster in the initial phase. We then show there is a phase transition in the convergence where the convergence speed of the components, especially those corresponding to the larger singular values, will decrease. Finally, we show that the convergence of the overall error (in the solution) tends to decay as more iterations are run, that is, the initial convergence is faster than the asymptote.
