Table of Contents
Fetching ...

Revisiting the Polyak step size

Elad Hazan, Sham Kakade

TL;DR

The paper addresses parameter-free optimization by showing that a simple Polyak step size $η_t = h_t / ||∇ f(x_t)||^2$, with $h_t = f(x_t) - f(x^*)$, achieves near-optimal convergence rates for gradient descent across all standard regimes (general convex, β-smooth, α-strongly convex, and β-smooth/α-strongly convex) without prior knowledge of problem constants. It also introduces an adaptive variant that requires only a lower bound $\tilde{f}_0 ≤ f(x^*)$ and refines this bound as needed, maintaining essentially the same performance with a logarithmic overhead in gradient updates. The key contributions are the unified analysis showing optimality of the exact Polyak step size in multiple regimes, and a practical adaptive scheme that eliminates the need to know $f(x^*)$ a priori. These results offer a parameter-free, scalable approach to gradient-based optimization with clear theoretical guarantees.

Abstract

This paper revisits the Polyak step size schedule for convex optimization problems, proving that a simple variant of it simultaneously attains near optimal convergence rates for the gradient descent algorithm, for all ranges of strong convexity, smoothness, and Lipschitz parameters, without a-priory knowledge of these parameters.

Revisiting the Polyak step size

TL;DR

The paper addresses parameter-free optimization by showing that a simple Polyak step size , with , achieves near-optimal convergence rates for gradient descent across all standard regimes (general convex, β-smooth, α-strongly convex, and β-smooth/α-strongly convex) without prior knowledge of problem constants. It also introduces an adaptive variant that requires only a lower bound and refines this bound as needed, maintaining essentially the same performance with a logarithmic overhead in gradient updates. The key contributions are the unified analysis showing optimality of the exact Polyak step size in multiple regimes, and a practical adaptive scheme that eliminates the need to know a priori. These results offer a parameter-free, scalable approach to gradient-based optimization with clear theoretical guarantees.

Abstract

This paper revisits the Polyak step size schedule for convex optimization problems, proving that a simple variant of it simultaneously attains near optimal convergence rates for the gradient descent algorithm, for all ranges of strong convexity, smoothness, and Lipschitz parameters, without a-priory knowledge of these parameters.

Paper Structure

This paper contains 6 sections, 6 theorems, 27 equations, 1 table, 3 algorithms.

Key Result

Lemma 1

The sequence of iterates produced by projected gradient descent (equation eq:gd) satisfies:

Theorems & Definitions (11)

  • Lemma 1
  • proof
  • Theorem 1
  • Theorem 2
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • proof
  • Lemma 4
  • ...and 1 more