Table of Contents
Fetching ...

The Essential Best and Average Rate of Convergence of the Exact Line Search Gradient Descent Method

Thomas Yu

TL;DR

This work addresses the convergence behavior of the exact line search gradient descent method on strongly convex quadratic objectives and resolves a long-standing question about the average and essential best-case rates in ill-conditioned regimes. By recasting the OGD dynamics as a discrete dynamical system via Akaike's map $T$ and employing center and stable manifolds, the authors precisely characterize when the average ROC tends to zero (two distinct eigenvalues) versus when it collapses toward the worst-case ROC (presence of an intermediate eigenvalue). The 2-D analysis provides explicit intuition and shows the average ROC can be arbitrarily fast in the absence of intermediate eigenvalues, while higher dimensions reveal a nuanced dependence on the spectrum that drives the essential bound. The results illuminate practical aspects for polynomial optimization problems (POPs) such as phase retrieval, where exact line search GD shows robust and competitive convergence despite ill-conditioning. Overall, the paper advances theoretical understanding of exact line search dynamics and connects classical ROC bounds with modern applications in imaging and data sciences.

Abstract

It is very well known that when the exact line search gradient descent method is applied to a convex quadratic objective, the worst-case rate of convergence (ROC), among all seed vectors, deteriorates as the condition number of the Hessian of the objective grows. By an elegant analysis due to H. Akaike, it is generally believed -- but not proved -- that in the ill-conditioned regime the ROC for almost all initial vectors, and hence also the average ROC, is close to the worst case ROC. We complete Akaike's analysis by determining the \emph{essential best case ROC} (defined in a measure-theoretic way) by using a dynamical system approach, facilitated by the theorem of center and stable manifolds. Our analysis also makes apparent the effect of an intermediate eigenvalue in the Hessian by establishing the following amusing result: In the absence of an intermediate eigenvalue, the average ROC gets arbitrarily \emph{fast} -- not slow -- as the Hessian gets increasingly ill-conditioned. We discuss in passing some contemporary applications of exact line search GD to well-conditioned polynomial optimization problems arising from imaging and data sciences. In particular, we observe that a tailored exact line search GD algorithm for a POP arising from the phase retrieval problem is only 50\% more expensive per iteration than its constant step size counterpart, while promising a ROC only matched by the optimally tuned (constant) step size which can rarely be achieved in practice.

The Essential Best and Average Rate of Convergence of the Exact Line Search Gradient Descent Method

TL;DR

This work addresses the convergence behavior of the exact line search gradient descent method on strongly convex quadratic objectives and resolves a long-standing question about the average and essential best-case rates in ill-conditioned regimes. By recasting the OGD dynamics as a discrete dynamical system via Akaike's map and employing center and stable manifolds, the authors precisely characterize when the average ROC tends to zero (two distinct eigenvalues) versus when it collapses toward the worst-case ROC (presence of an intermediate eigenvalue). The 2-D analysis provides explicit intuition and shows the average ROC can be arbitrarily fast in the absence of intermediate eigenvalues, while higher dimensions reveal a nuanced dependence on the spectrum that drives the essential bound. The results illuminate practical aspects for polynomial optimization problems (POPs) such as phase retrieval, where exact line search GD shows robust and competitive convergence despite ill-conditioning. Overall, the paper advances theoretical understanding of exact line search dynamics and connects classical ROC bounds with modern applications in imaging and data sciences.

Abstract

It is very well known that when the exact line search gradient descent method is applied to a convex quadratic objective, the worst-case rate of convergence (ROC), among all seed vectors, deteriorates as the condition number of the Hessian of the objective grows. By an elegant analysis due to H. Akaike, it is generally believed -- but not proved -- that in the ill-conditioned regime the ROC for almost all initial vectors, and hence also the average ROC, is close to the worst case ROC. We complete Akaike's analysis by determining the \emph{essential best case ROC} (defined in a measure-theoretic way) by using a dynamical system approach, facilitated by the theorem of center and stable manifolds. Our analysis also makes apparent the effect of an intermediate eigenvalue in the Hessian by establishing the following amusing result: In the absence of an intermediate eigenvalue, the average ROC gets arbitrarily \emph{fast} -- not slow -- as the Hessian gets increasingly ill-conditioned. We discuss in passing some contemporary applications of exact line search GD to well-conditioned polynomial optimization problems arising from imaging and data sciences. In particular, we observe that a tailored exact line search GD algorithm for a POP arising from the phase retrieval problem is only 50\% more expensive per iteration than its constant step size counterpart, while promising a ROC only matched by the optimally tuned (constant) step size which can rarely be achieved in practice.
Paper Structure (13 sections, 12 theorems, 67 equations, 5 figures, 1 algorithm)

This paper contains 13 sections, 12 theorems, 67 equations, 5 figures, 1 algorithm.

Key Result

Theorem 1.2

(i) If $A$ has only two distinct eigenvalues, then the average ROC approaches 0 when ${\rm cond}(A) \rightarrow \infty$. (ii) If $A$ has an intermediate eigenvalue $\lambda_i$ uniformly bounded away from the two extremal eigenvalues, then the average ROC approaches the worst case ROC in eq:AverageWo

Figures (5)

  • Figure 1: ROC of constant step size GD vs optimum GD for the phase retrieval problem
  • Figure 2: In dimension $n=3$, every point on the boundary of $\Delta_3\backslash \{e_1,e_2,e_3\}$ is a fixed point of $T^2=T\circ T$. Every point in the interior of $\Delta_3$ is attracted by $T^2$ to some point in the blue line segment. The analysis is done by writing $T$ as a map on the projected simplex of $\Delta$ on the $p_1$-$p_2$ plane, colored in light purple; see \ref{['eq:Remove1dof']}.
  • Figure 3: Left: Plots of $\theta$ versus $\rho^\ast([\cos(\theta),\sin(\theta)]^T,[1,a]^T)$ for various values of $a$. Observe the convergence in \ref{['eq:NonUniform']} being non-uniform in $\theta$. Right: The worst, average and the square root of the average square rate of convergence as a function of $a$. The average rate of convergence is computed using numerical integration, while the other two curves are given by the closed-form expressions \ref{['eq:Worst2D']} and \ref{['eq:closed']}.
  • Figure 4: Distribution of the limit angle $\theta$, estimated from $10^7$ initial vectors $x^{(0)}$ sampled from the uniform distribution on the unit 2-sphere in 3-D. The distribution is supported on a sub-interval $J$ of $[0,\pi/2]$, where $J$ is defined by \ref{['eq:Interval_J']}. The horizontal lines show the essential best ROC in \ref{['eq:LowerBound']}. The left vertical axis is for ROC, while the right vertical axis is for probability density. The black dot corresponds to the angle $\tan^{-1}(a^{-1})$, which yields the slowest ROC $(1-a)/(1+a)$.
  • Figure 5: ROC of the optimum GD method applied to the Rosenbrock function with $n$ variables with 500 initial guesses uniformly sampled from the unit ball around the unique minimizer. The black solid and black dashed lines illustrate the worst case and essential best ROCs, respectively, assuming that the objective were a quadratic with Hessian $\nabla^2 f(x^\ast)$.

Theorems & Definitions (15)

  • Definition 1.1
  • Theorem 1.2
  • Theorem 1.3
  • Remark 1.4
  • Remark 1.5
  • Proposition 2.1
  • Proposition 2.2: Independence of $\lambda_1$ and $\lambda_n$
  • Theorem 2.3
  • Proposition 3.1
  • Lemma 4.1
  • ...and 5 more