Monitoring the Convergence Speed of PDHG to Find Better Primal and Dual Step Sizes

Olivier Fercoq

Monitoring the Convergence Speed of PDHG to Find Better Primal and Dual Step Sizes

Olivier Fercoq

TL;DR

This work base their method on a spectral radius estimation procedure and try to minimize this spectral radius, which is directly related to the rate of convergence, and shows that their step size rule yields an algorithm as fast as inertial gradient descent.

Abstract

Primal-dual algorithms for the resolution of convex-concave saddle point problems usually come with one or several step size parameters. Within the range where convergence is guaranteed, choosing well the step size can make the difference between a slow or a fast algorithm. A usual way to adaptively set step sizes is to ensure that there is a fair balance between primal and dual variable's amount of change. In this work, we show how to find even better step sizes for the primal-dual hybrid gradient. Getting inspiration from quadratic problems, we base our method on a spectral radius estimation procedure and try to minimize this spectral radius, which is directly related to the rate of convergence. Building on power iterations, we could produce spectral radius estimates that are always smaller than 1 and work also in the case of conjugate principal eigenvalues. For strongly convex quadratics, we show that our step size rule yields an algorithm as fast as inertial gradient descent. Moreover, since our spectral radius estimates only rely on residual norms, our method can be readily adapted to more general convex-concave saddle point problems. In a second part, we extend these results to a randomized version of PDHG called PURE-CD. We design a statistical test to compare observed convergence rates and decide whether a step size is better than another. Numerical experiments on least squares, sparse SVM, TV-L1 denoising and TV-L2 denoising problems support our findings.

Monitoring the Convergence Speed of PDHG to Find Better Primal and Dual Step Sizes

TL;DR

Abstract

Paper Structure (29 sections, 7 theorems, 77 equations, 9 figures, 1 table, 8 algorithms)

This paper contains 29 sections, 7 theorems, 77 equations, 9 figures, 1 table, 8 algorithms.

Introduction
Problem, notation and primal-dual algorithm
Vũ-Condat algorithm
Latafat et al.'s Tri-PD algorithm
Linear convergence rate
Adaptive step sizes based on residual balance
Goldstein et al.'s adaptive step sizes
Generalization to the case $f_2 \neq 0$
Residual balance for Tri-PD
Quadratic case
Minimization of the spectral radius
Estimation of the spectral radius using the power method
Unique principal eigenvalue
Conjugate pair of principal eigenvalues
Goldstein warm-up
...and 14 more sections

Key Result

Proposition 1

Consider the toy problem where $f$ in $\mu_f$-strongly convex and $g^*$ is $\mu_{g^*}$-strongly convex and suppose we are solving it with Algorithm alg:vu-condat with constant step sizes $\tau$ and $\sigma$ such that $\sigma \tau = \gamma < 1$. Whatever the value of $\Delta$, if $\tau > \sqrt{\frac{ \gamma\mu_{g^*}}{\mu_{f

Figures (9)

Figure 1: Comparison of norms for the computation of spectral radius estimates. The problem under consideration is a least squares problem $\min_x \max_y \mu_x/2 ||x||^2 + \langle Ax, y \rangle - \mu_y/2 ||y||^2$ where each line of $A$ is such that $A_i x = (1+\eta) x_i - x_{i+1}$. We took $\mu_x = 0.01$, $\mu_y = 0.1$ and $\eta = 0.001$. The step sizes are constant with $\tau = 10 / \|A\|$. We can see that using the norm for which nonexpansiveness is guaranteed reduces a lot the amplitude of oscillations.
Figure 2: Comparison of several estimates of the spectral radius. The true value is in dotted red line, the instantaneous estimate $||u_{k+1}||_V/||u_k||_V$ is in solid green line, the long run estimate $(||u_{k}||_V/||u_{k-s}||_V)^{1/s}$ is in orange dash-dotted line and the estimate proposed in this paper based on the study of cycles is in blue dashed line. The problem under consideration is the same as for Figure \ref{['fig:compareVand2norms']}. On the left plot, the step sizes are constant with $\tau = 10 / \|A\|$. The instantaneous estimate fails because it oscillates. One can remark that the oscillations are far from being negligible. The long run and cycle based estimates behave similarly in this context. On the right plot, $\tau$ is modified online by monitoring the convergence rate, starting from $\tau_0 = 100 / \|A\|$, as will be explained in Section \ref{['sec:adap_fercoq']}.
Figure 3: Zooms on two cases for rate estimation. When the matrix is changing, we may encounter either the case where the principal eigenvalue is unique (left plot) or the case where there is a conjugate pain of principal eigenvalues. In both cases, the cycle based estimate is the most accurate, and is able to take profit of warm start when we modify the step sizes. Hence one single cycle is enough to get a precision allowing us to discriminate between a better and a worse rate.
Figure 4: Comparison of adaptive step algorithms on the toy quadratic problem of Figure \ref{['fig:compareVand2norms']}, where we initialized $\tau_0 = \frac{0.01}{\|A\|}$. Left: distance to the saddle point. Right: value of $\tau_k$ for each iteration (same line colors). We had chosen an initial step size value which is far from the optimal one. Thus the base algorithm is quite slow. Moreover, when trying to estimate the rate, we need many iteration before being able to discriminate between two slow rates. Residual balance yields a very quick update of the step sizes to something reasonable. We can see an actual decrease on the left plot. However, the algorithm is not able to deal with the oscillating behavior of the residuals: it quickly drains its updating budget and stalls. Combining both methods gives the solid green line. Residual balance gives a sufficiently good step size allow accurate rate estimates. After a few updates, we obtain a rate nearly as good as what can be obtained if we directly optimize the spectral radius using 0th-order optimization.
Figure 5: Comparison of various adaptive step sizes strategies for PDHG. Quadratic problem: $\lambda = 10^{-3}$, $A$ and $b$ given in the a1a dataset chang2011libsvm for $\min_x \frac{1}{2} \|Ax-b\|^2_2 + \frac{\lambda}{2n}\|x\|^2_2$. We initialize the step size with a factor 1000 compared to the optimal step sizes given in Section \ref{['sec:linconv']}. Dotted blue line: constant step sizes. Dashed orange line: Alg. \ref{['alg:adap_fercoq']} based on rate estimation using residual norm and with $\alpha_0 = 0$. Dash-dotted red line: Goldstein et al.'s adaptive step sizes (Alg. \ref{['alg:goldstein']}). Solid green line: Alg. \ref{['alg:adap_fercoq']} combining Goldstein et al's and our step size adaptation. Loosely dash-dotted purple line: Restarted FISTA with optimal restart period.
...and 4 more figures

Theorems & Definitions (14)

Proposition 1
proof
Proposition 2
proof
Lemma 1: Lemma 1 in tran2020adaptive
Theorem 1
proof
Remark 1
Proposition 3
proof
...and 4 more

Monitoring the Convergence Speed of PDHG to Find Better Primal and Dual Step Sizes

TL;DR

Abstract

Monitoring the Convergence Speed of PDHG to Find Better Primal and Dual Step Sizes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (14)