Table of Contents
Fetching ...

A Two-timescale Primal-dual Algorithm for Decentralized Optimization with Compression

Haoming Liu, Chung-Yiu Yau, Hoi-To Wai

TL;DR

This work addresses decentralized optimization under communication constraints by introducing TiCoPD, a two-timescale primal-dual algorithm that supports nonlinear compression through a majorization-minimization surrogate. By decoupling communication from optimization via a compressed surrogate $\hat{\mathbf{X}}^t$ and employing a contractive compressor, TiCoPD achieves convergence with a constant stepsize and an $O(1/T)$ stationary-point rate without assuming bounded gradient heterogeneity. The main contributions include the MM-based surrogate, the two-timescale update, and a convergence guarantee under standard smoothness and compression assumptions, validated on neural-network training over a network. Overall, the method reduces communication overhead in distributed learning while broadening the applicability of compression-enabled decentralized optimization.

Abstract

This paper proposes a two-timescale compressed primal-dual (TiCoPD) algorithm for decentralized optimization with improved communication efficiency over prior works on primal-dual decentralized optimization. The algorithm is built upon the primal-dual optimization framework and utilizes a majorization-minimization procedure. The latter naturally suggests the agents to share a compressed difference term during the iteration. Furthermore, the TiCoPD algorithm incorporates a fast timescale mirror sequence for agent consensus on nonlinearly compressed terms, together with a slow timescale primal-dual recursion for optimizing the objective function. We show that the TiCoPD algorithm converges with a constant step size. It also finds an O(1 /T ) stationary solution after T iterations. Numerical experiments on decentralized training of a neural network validate the efficacy of TiCoPD algorithm.

A Two-timescale Primal-dual Algorithm for Decentralized Optimization with Compression

TL;DR

This work addresses decentralized optimization under communication constraints by introducing TiCoPD, a two-timescale primal-dual algorithm that supports nonlinear compression through a majorization-minimization surrogate. By decoupling communication from optimization via a compressed surrogate and employing a contractive compressor, TiCoPD achieves convergence with a constant stepsize and an stationary-point rate without assuming bounded gradient heterogeneity. The main contributions include the MM-based surrogate, the two-timescale update, and a convergence guarantee under standard smoothness and compression assumptions, validated on neural-network training over a network. Overall, the method reduces communication overhead in distributed learning while broadening the applicability of compression-enabled decentralized optimization.

Abstract

This paper proposes a two-timescale compressed primal-dual (TiCoPD) algorithm for decentralized optimization with improved communication efficiency over prior works on primal-dual decentralized optimization. The algorithm is built upon the primal-dual optimization framework and utilizes a majorization-minimization procedure. The latter naturally suggests the agents to share a compressed difference term during the iteration. Furthermore, the TiCoPD algorithm incorporates a fast timescale mirror sequence for agent consensus on nonlinearly compressed terms, together with a slow timescale primal-dual recursion for optimizing the objective function. We show that the TiCoPD algorithm converges with a constant step size. It also finds an O(1 /T ) stationary solution after T iterations. Numerical experiments on decentralized training of a neural network validate the efficacy of TiCoPD algorithm.
Paper Structure (6 sections, 1 theorem, 19 equations, 1 figure, 1 algorithm)

This paper contains 6 sections, 1 theorem, 19 equations, 1 figure, 1 algorithm.

Key Result

Theorem 4.4

Under Assumptions assm:lip--assm:compress, suppose the step sizes satisfy $\eta >0, \theta \ge \theta_{lb}, \alpha\le\alpha_{ub}$, where where $\delta_2 = \max\{\frac{16 \eta}{\delta},1 \}$, $\delta_1 = 12 \max\{ 2, 2\tilde{\rho}_2^{-1} \eta^{-1}, \delta_2\tilde{\delta}\}$, $\tilde{\delta}=\max\{\frac{(1-\delta)^2(1-\frac{\delta}{2})^2}{(1-\frac{\delta}{2})^2-(1-\delta)^2},1\}$. Then, for any $T

Figures (1)

  • Figure 1: Training a 2-layer feedforward network using the MNIST data. The bit-rates for communication quantization are displayed in the legend.

Theorems & Definitions (2)

  • Remark 3.1
  • Theorem 4.4