Table of Contents
Fetching ...

Acceleration for Polyak-Łojasiewicz Functions with a Gradient Aiming Condition

Julien Hermant

TL;DR

The paper investigates when momentum acceleration improves convergence for Polyak-Łojasiewicz (PL) functions, highlighting that PL alone does not guarantee acceleration and that strong quasar-convexity can be insufficient. It introduces the gradient aiming condition AC^a, which quantifies alignment between the descent direction and the minimizer, and proves accelerated convergence for gradient methods under AC^a when the alignment is large, with explicit continuous-time and discrete-time bounds. It further relaxes AC to an average-aiming condition along the optimization path, showing that acceleration can persist on average even if AC^a fails globally. Through a 2D PL counterexample and neural-network experiments, the work clarifies when momentum helps or hinders and provides practical guidance for designing accelerated first-order methods in nonconvex settings.

Abstract

It is known that when minimizing smooth Polyak-Łojasiewicz (PL) functions, momentum algorithms cannot significantly improve the convergence bound of gradient descent, contrasting with the acceleration phenomenon occurring in the strongly convex case. To bridge this gap, the literature has proposed strongly quasar-convex functions as an intermediate non-convex class, for which accelerated bounds have been suggested to persist. We show that this is not true in general: the additional structure of strong quasar-convexity does not suffice to guaranty better worst-case bounds for momentum compared to gradient descent. As an alternative, we study PL functions under an aiming condition that measures how well the descent direction points toward a minimizer. This perspective clarifies the geometric ingredient enabling provable acceleration by momentum when minimizing PL functions.

Acceleration for Polyak-Łojasiewicz Functions with a Gradient Aiming Condition

TL;DR

The paper investigates when momentum acceleration improves convergence for Polyak-Łojasiewicz (PL) functions, highlighting that PL alone does not guarantee acceleration and that strong quasar-convexity can be insufficient. It introduces the gradient aiming condition AC^a, which quantifies alignment between the descent direction and the minimizer, and proves accelerated convergence for gradient methods under AC^a when the alignment is large, with explicit continuous-time and discrete-time bounds. It further relaxes AC to an average-aiming condition along the optimization path, showing that acceleration can persist on average even if AC^a fails globally. Through a 2D PL counterexample and neural-network experiments, the work clarifies when momentum helps or hinders and provides practical guidance for designing accelerated first-order methods in nonconvex settings.

Abstract

It is known that when minimizing smooth Polyak-Łojasiewicz (PL) functions, momentum algorithms cannot significantly improve the convergence bound of gradient descent, contrasting with the acceleration phenomenon occurring in the strongly convex case. To bridge this gap, the literature has proposed strongly quasar-convex functions as an intermediate non-convex class, for which accelerated bounds have been suggested to persist. We show that this is not true in general: the additional structure of strong quasar-convexity does not suffice to guaranty better worst-case bounds for momentum compared to gradient descent. As an alternative, we study PL functions under an aiming condition that measures how well the descent direction points toward a minimizer. This perspective clarifies the geometric ingredient enabling provable acceleration by momentum when minimizing PL functions.
Paper Structure (42 sections, 23 theorems, 232 equations, 11 figures, 1 table)

This paper contains 42 sections, 23 theorems, 232 equations, 11 figures, 1 table.

Key Result

Theorem 4.2

Let $(x_t)_{t\ge 0} \sim eq:gf$. (i) If $f \in {\text{PL}^\mu}$, then (ii) If $f \in {\text{PL}^\mu} \cap \text{AC}^{a} \cap \text{QG}_{+}^{L_0}$, then where $\mu_0 := \sup\{\mu' \ge\mu : f \in \text{QG}_{-}^{\mu'}\}$.

Figures (11)

  • Figure 1: Left: graph of $f(t) = 5(t+0.19\sin(5t))^2$, minimizer is $0$. Right: graph of $f(x,y) = 0.5(0.5x^2 -y)^2 + 0.05x^2$, minimizer is $(0,0)$. Bottom: table giving for each function the numerical value of the theoretical convergence rate for \ref{['eq:gd']} or \ref{['eq:nm']}, precising what class of function is used to characterize the bound (see Table \ref{['table:sc_sqc_rates']}), where $\Lambda$ is the \ref{['cr']}. See implementation details in Appendix \ref{['app:param_details']}. It shows that depending on the functions, ${\text{PL}^\mu}$-based bound may be sharper than $\text{SQC}_{\tau}^{\mu}$-based bounds, and conversely.
  • Figure 2: For a range of parameters $\tau$, we compute the highest admissible $\mu$ ensuring $f \in \text{SQC}_{\tau}^{\mu}$, and plot the numerical values of the associated theoretical convergence rates for \ref{['eq:gd']} and \ref{['eq:nm']} under $\text{SQC}_{\tau}^{\mu} \cap \text{LS}^{L}$, namely $\tau \mu/L$ and $\tau \sqrt{\mu/L}$, with $f(t) = 5(t+0.19\sin(5t))^2$. See implementation details in Appendix \ref{['app:param_details']}. Note that the pair $(\tau, \mu)$ that maximizes the convergence rate of \ref{['eq:gd']} differs from the pair that maximizes the one of \ref{['eq:nm']}.
  • Figure 3: Left: Heatmap of $F_{0.001}(x,y) = 0.5(y-\sin(x))^2 + 0.001\cdot0.5x^2$. Blue arrows indicate the descent direction $-\nabla F_{0.001}$. Center: First $1000$ iterations of the trajectories of \ref{['eq:gd']} and \ref{['eq:nm_prime']}, both starting from the initialization point $(0,3)$, for different values of momentum parameter $\alpha$. Top right: Corresponding decrease of $\log(f)$. Bottom right: Values of the aiming condition along the iterates, zoomed on the first 100 iterations. Early negative aiming condition values cause momentum to drive the trajectory away from the minimizer, leading \ref{['eq:gd']} to outperform momentum in early iterations, with increasing effect for larger $\alpha$.
  • Figure 4: Visualization of functions belonging in $\text{SQC}_{\tau}^{\mu}$, for some parameters. The figure is borrowed from hermant2025continuized, see details in Appendix \ref{['app:conceptual_figures']}.
  • Figure 5: The grey curve is $f(t) = 0.5 \cdot 5(t+0.07\sin(13t))^2$. The blue (red) curve is a quadratic upper (lower) bound. We compute the largest $\mu$ and smallest $L$ such that $f \in {\text{PL}^\mu} \cap \text{LS}^{L}$ on the displayed domain, numerically evaluated at $\mu \approx 4\cdot10^{-2}$, $L \approx 6\cdot 10^2$. Also, the parameters $L_0$ and $\mu_0$ that parameterize the quadratic bounds are $\mu_0 \approx 3$ and $L_0 \approx 18$. On this prototype 1-dimensional example, $\frac{\mu_0}{L_0} \approx 0.2$ while $\frac{\mu}{L} \approx 7\cdot 10^{-5}$. This highlight the possibility of a significant gap between these two ratios.
  • ...and 6 more figures

Theorems & Definitions (40)

  • Definition 2.1
  • Definition 2.2
  • Definition 2.3
  • Definition 2.4
  • Definition 2.5
  • Definition 4.1
  • Theorem 4.2
  • Theorem 4.3
  • Theorem 4.4
  • Theorem 5.2
  • ...and 30 more