Adaptive proximal gradient methods are universal without approximation

Konstantinos A. Oikonomidis; Emanuel Laude; Puya Latafat; Andreas Themelis; Panagiotis Patrinos

Adaptive proximal gradient methods are universal without approximation

Konstantinos A. Oikonomidis, Emanuel Laude, Puya Latafat, Andreas Themelis, Panagiotis Patrinos

TL;DR

This work tackles composite convex minimization with a locally Hölder smooth gradient, addressing the lack of a global Lipschitz constant by introducing linesearch-free adaptive proximal gradient methods that are universal to the Hölder order $\nu$. The AdaPG family leverages local Hölder estimates and scaled stepsizes to achieve descent without approximation and without requiring knowledge of $\nu$, proving full-sequence convergence with an explicit rate $O((K+1)^{-\nu})$ for $\nu \in (0,1]$ on semi-algebraic $C^1$ functions. A unified update rule recovers several prior adaptive schemes as special cases, and practical aspects include initialization refinements and theoretical bounds on step sizes. Numerical results demonstrate competitive performance across locally and globally Hölder-smooth problems, underscoring the approach's robustness when classic Lipschitz assumptions fail.

Abstract

We show that adaptive proximal gradient methods for convex problems are not restricted to traditional Lipschitzian assumptions. Our analysis reveals that a class of linesearch-free methods is still convergent under mere local Hölder gradient continuity, covering in particular continuously differentiable semi-algebraic functions. To mitigate the lack of local Lipschitz continuity, popular approaches revolve around $\varepsilon$-oracles and/or linesearch procedures. In contrast, we exploit plain Hölder inequalities not entailing any approximation, all while retaining the linesearch-free nature of adaptive schemes. Furthermore, we prove full sequence convergence without prior knowledge of local Hölder constants nor of the order of Hölder continuity. Numerical experiments make comparisons with baseline methods on diverse tasks from machine learning covering both the locally and the globally Hölder setting.

Adaptive proximal gradient methods are universal without approximation

TL;DR

. The AdaPG family leverages local Hölder estimates and scaled stepsizes to achieve descent without approximation and without requiring knowledge of

, proving full-sequence convergence with an explicit rate

for

on semi-algebraic

functions. A unified update rule recovers several prior adaptive schemes as special cases, and practical aspects include initialization refinements and theoretical bounds on step sizes. Numerical results demonstrate competitive performance across locally and globally Hölder-smooth problems, underscoring the approach's robustness when classic Lipschitz assumptions fail.

Abstract

-oracles and/or linesearch procedures. In contrast, we exploit plain Hölder inequalities not entailing any approximation, all while retaining the linesearch-free nature of adaptive schemes. Furthermore, we prove full sequence convergence without prior knowledge of local Hölder constants nor of the order of Hölder continuity. Numerical experiments make comparisons with baseline methods on diverse tasks from machine learning covering both the locally and the globally Hölder setting.

Paper Structure (6 sections, 6 theorems, 26 equations)

This paper contains 6 sections, 6 theorems, 26 equations.

Introduction
Universal, adaptive, without approximation
Hölder continuity estimates
AdaPG revisited
Preliminary lemmas
Convergence and rates

Key Result

Lemma 2.3

Let $\ell_{{k}, \nu }$ and $L_{{k}, \nu }$ be as in eq:lL for some $x^{k-1},x^k\in\R^n$, and let $H_k$ be as in eq:Hk for some $\gamma_{k}>0$. Then, for any $\nu \in[0,1]$ and with $\lambda_{{k}, \nu }$ as in eq:lamk it holds that

Theorems & Definitions (7)

Lemma 2.3
Lemma 3.1: FNE-like inequality
Corollary 3.2
Lemma 3.3: main inequality
Lemma 3.4: basic properties of \ref{['alg:adaPG']}
Remark 3.5
Lemma 3.6

Adaptive proximal gradient methods are universal without approximation

TL;DR

Abstract

Adaptive proximal gradient methods are universal without approximation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (7)