Table of Contents
Fetching ...

Adaptive proximal algorithms for convex optimization under local Lipschitz continuity of the gradient

Puya Latafat, Andreas Themelis, Lorenzo Stella, Panagiotis Patrinos

TL;DR

This work addresses convex composite optimization with locally Lipschitz gradients by introducing adaptive, linesearch-free first-order methods. The authors develop adaPG for the proximal gradient and extend it to a three-term adaptive primal-dual scheme adaPD, with an essentially fully adaptive variant adaPDls that avoids computing the operator norm via backtracking. The core idea combines tight local estimates of cocoercivity and Lipschitz continuity, encapsulated in the quantities $\ell_k$ and $c_k$, to update stepsizes without function-value evaluations, and it provides convergence guarantees and sublinear rates. Numerical experiments on logistic regression, cubic regularization, regularized least squares, dual SVM, LAD, and square-root lasso demonstrate robust performance gains over linesearch-based methods, highlighting practical impact for large-scale convex optimization with nonsmooth terms.

Abstract

Backtracking linesearch is the de facto approach for minimizing continuously differentiable functions with locally Lipschitz gradient. In recent years, it has been shown that in the convex setting it is possible to avoid linesearch altogether, and to allow the stepsize to adapt based on a local smoothness estimate without any backtracks or evaluations of the function value. In this work we propose an adaptive proximal gradient method, adaPG, that uses novel estimates of the local smoothness modulus which leads to less conservative stepsize updates and that can additionally cope with nonsmooth terms. This idea is extended to the primal-dual setting where an adaptive three-term primal-dual algorithm, adaPD, is proposed which can be viewed as an extension of the PDHG method. Moreover, in this setting the "essentially" fully adaptive variant adaPD$^+$ is proposed that avoids evaluating the linear operator norm by invoking a backtracking procedure, that, remarkably, does not require extra gradient evaluations. Numerical simulations demonstrate the effectiveness of the proposed algorithms compared to the state of the art.

Adaptive proximal algorithms for convex optimization under local Lipschitz continuity of the gradient

TL;DR

This work addresses convex composite optimization with locally Lipschitz gradients by introducing adaptive, linesearch-free first-order methods. The authors develop adaPG for the proximal gradient and extend it to a three-term adaptive primal-dual scheme adaPD, with an essentially fully adaptive variant adaPDls that avoids computing the operator norm via backtracking. The core idea combines tight local estimates of cocoercivity and Lipschitz continuity, encapsulated in the quantities and , to update stepsizes without function-value evaluations, and it provides convergence guarantees and sublinear rates. Numerical experiments on logistic regression, cubic regularization, regularized least squares, dual SVM, LAD, and square-root lasso demonstrate robust performance gains over linesearch-based methods, highlighting practical impact for large-scale convex optimization with nonsmooth terms.

Abstract

Backtracking linesearch is the de facto approach for minimizing continuously differentiable functions with locally Lipschitz gradient. In recent years, it has been shown that in the convex setting it is possible to avoid linesearch altogether, and to allow the stepsize to adapt based on a local smoothness estimate without any backtracks or evaluations of the function value. In this work we propose an adaptive proximal gradient method, adaPG, that uses novel estimates of the local smoothness modulus which leads to less conservative stepsize updates and that can additionally cope with nonsmooth terms. This idea is extended to the primal-dual setting where an adaptive three-term primal-dual algorithm, adaPD, is proposed which can be viewed as an extension of the PDHG method. Moreover, in this setting the "essentially" fully adaptive variant adaPD is proposed that avoids evaluating the linear operator norm by invoking a backtracking procedure, that, remarkably, does not require extra gradient evaluations. Numerical simulations demonstrate the effectiveness of the proposed algorithms compared to the state of the art.
Paper Structure (28 sections, 8 theorems, 80 equations, 4 figures, 1 table, 3 algorithms)

This paper contains 28 sections, 8 theorems, 80 equations, 4 figures, 1 table, 3 algorithms.

Key Result

lemma 1

Suppose that ass:PG holds, and let $x^{k-1},x^k\in\R^n$. Then, with $L_k$, $\ell _k$, and $c_k$ as in eq:MML_k and eq:CL the following hold:

Figures (4)

  • Figure 1: Simulations for problem \ref{['eq:PG']}.
  • Figure 2: Simulations for the dual SVM problem \ref{['eq:dsvm']}. First row: $C = 1$, second row: $C = 0.1$. The $x$-axis reports gradient evaluations, which is the most expensive operation (since $A\in \R^{1\times N}$, calls to $A$ and $\trans A$ are negligible). As explained, in this case \ref{['alg:PDls']} is indistinguishable from \ref{['alg:PD']} and thus omitted. \ref{['alg:PD']} and MP-ls are tuned for best performance by a grid search for $t\in[0.01, 100]$.
  • Figure 3: Simulations for problem \ref{['eq:medreg']} of \ref{['sec:MDreg']} with $\lambda = 10$. First row: regularized least-absolute deviation ($p=1$); second row: square-root lasso ($p=2$). AdaPDM+ and MP-ls are tuned for best performance by a grid search for $t\in[0.01, 100]$. Having $f = 0$, adaPDM reduces to PDHG with worse (constant) stepsizes and is thus omitted from the comparisons.
  • Figure 4: Demonstrative plot of stepsize magnitudes in windows of 50 and 100 iterations extracted from the simulations in this section. First row, left: logistic regression, mushroom dataset; right: lasso problem with $m=500$, $n=1000$, $n_\star = 100$. As commented after \ref{['rem:PG:Malitsky']}, despite the fact that the stepsize update rule of adaPGM is less conservative than that of adaPGM-MM, the comparison does not carry over iterationwise. Second row, left: dual SVM, mushroom dataset, $C=0.1$; right: square-root lasso (housing dataset, $\lambda=10$). Primal-dual algorithms are compared on problems where the ratio $t^2=\sigma_k/\gamma_k$ coincides, so that the plots are representative also for dual stepsizes $\sigma_k$.

Theorems & Definitions (17)

  • lemma 1
  • proof
  • lemma 2
  • proof
  • theorem 1
  • proof
  • remark 1: Comparison with malitsky2020adaptive
  • corollary 1: Proximal extension of malitsky2020adaptive
  • proof
  • remark 2: Alternative stepsize choices
  • ...and 7 more