Table of Contents
Fetching ...

Multi-Timescale Primal Dual Hybrid Gradient with Application to Distributed Optimization

Junhui Zhang, Patrick Jaillet

Abstract

We propose two variants of the Primal Dual Hybrid Gradient (PDHG) algorithm for saddle point problems with block decomposable duals, hereafter called Multi-Timescale PDHG (MT-PDHG) and its accelerated variant (AMT-PDHG). Through novel mixtures of Bregman divergence and multi-timescale extrapolations, our MT-PDHG and AMT-PDHG converge under arbitrary updating rates for different dual blocks while remaining fully deterministic and robust to extreme delays in dual updates. We further apply our (A)MT-PDHG, augmented with the gradient sliding techniques introduced in Lan et al. (2020), Lan (2016), to distributed optimization. The flexibility in choosing different updating rates for different blocks allows a more refined control over the communication rounds between different pairs of agents, thereby improving the efficiencies in settings with heterogeneity in local objectives and communication costs. Moreover, with careful choices of penalty levels, our algorithms show linear and thus optimal dependency on function similarities, a measure of how similar the gradients of local objectives are. This provides a positive answer to the open question whether such dependency is achievable for non-smooth objectives (Arjevani and Shamir 2015).

Multi-Timescale Primal Dual Hybrid Gradient with Application to Distributed Optimization

Abstract

We propose two variants of the Primal Dual Hybrid Gradient (PDHG) algorithm for saddle point problems with block decomposable duals, hereafter called Multi-Timescale PDHG (MT-PDHG) and its accelerated variant (AMT-PDHG). Through novel mixtures of Bregman divergence and multi-timescale extrapolations, our MT-PDHG and AMT-PDHG converge under arbitrary updating rates for different dual blocks while remaining fully deterministic and robust to extreme delays in dual updates. We further apply our (A)MT-PDHG, augmented with the gradient sliding techniques introduced in Lan et al. (2020), Lan (2016), to distributed optimization. The flexibility in choosing different updating rates for different blocks allows a more refined control over the communication rounds between different pairs of agents, thereby improving the efficiencies in settings with heterogeneity in local objectives and communication costs. Moreover, with careful choices of penalty levels, our algorithms show linear and thus optimal dependency on function similarities, a measure of how similar the gradients of local objectives are. This provides a positive answer to the open question whether such dependency is achievable for non-smooth objectives (Arjevani and Shamir 2015).

Paper Structure

This paper contains 31 sections, 18 theorems, 122 equations, 8 figures, 3 algorithms.

Key Result

Corollary 3.1

Consider the following updates using $GS$, the generalized gradient sliding procedure in Algorithm alg:CS-procedure, where $T_k\in \mathbb{N}$ and $\eta_k = \sum_{s=1}^S \eta_{k,s}>0$, where $\eta_{k,s}\geq 0$, Assume that eq:prop-F holds with some $\mu\geq 0$, then with $\lambda_t = t+1$ and $\beta_t =\frac{t}{2}$ for $t\geq 1$, eq:primal-approx-prop holds with Further, if eq:prop-F holds with

Figures (8)

  • Figure 1: Updates for $S=3$, $r_1=r_2 = 3$ and $r_3 = 6$. Each marker represents one update: $(X^k, \widehat{X}^k)$ is updated at each global time $k=0,1,\ldots,17$. If generalized gradient sliding is used, then this involves $T_k$ iterations of mirror descent updates at iteration $k$. $y_1$ is updated at each local time $i_1 = 0,\ldots,5$, i.e. global time $k = 0,3,6,\ldots,15$, and similarly for $y_2$ and $y_3$.
  • Figure 2: Left: abstract setting with $m$ primal agents and $S$ dual agents. Middle: realization in the decentralized setting, where $S = m=4$, $\mathsf{Agent}(x_s) = \mathsf{Agent}(y_s)$, and the underlying graph is $(V,E=\{\{1,3\},\{1,4\},\{2,3\}\})$. Right: realization in the hierarchical setting.
  • Figure 3: KKT residual as a function of global iteration for different $(m,n)$, under 4 different combinations of the updating rates for the dual blocks: 1. $r_s=1$ for all $s$; 2. $r_1 = r_2 = r_3 = 1$ and $r_4 = r_5 = r_6 = 10$; 3. $r_s=10$ for all $s$; 4. $r_s=50$ for all $s$.
  • Figure 4: KKT residual as a function of running time in seconds for different $(m,n)$, under 4 different combinations of the updating rates for the dual blocks: 1. $r_s=1$ for all $s$; 2. $r_1 = r_2 = r_3 = 1$ and $r_4 = r_5 = r_6 = 10$; 3. $r_s=10$ for all $s$; 4. $r_s=50$ for all $s$.
  • Figure 5: Dependence of $F(\Pi \underline{X}^k)$ on the iteration number $k$ and the mean updating rate $\overline{r}$ for MT-PDHG ($\mu=0$) (left) and AMT-PDHG ($\mu=0.01$) with communication sliding. Legends represent $(r_1,r_2,r_3,\overline{r})$ and line colors represent $\overline{r}$.
  • ...and 3 more figures

Theorems & Definitions (30)

  • Corollary 3.1
  • Theorem 3.1
  • proof : Proof of Theorem \ref{['thm:main']}
  • Corollary 3.2
  • proof : Proof of Corollary \ref{['cor:main-cs']}
  • Corollary 3.3
  • proof : Proof of Corollary \ref{['cor:main']}
  • Theorem 3.2
  • proof : Proof of Theorem \ref{['thm:acc_convergence']}
  • Corollary 3.4
  • ...and 20 more