Accelerating Distributed Optimization: A Primal-Dual Perspective on Local Steps
Junchi Yang, Murat Yildirim, Qiu Feng
TL;DR
This paper tackles distributed optimization across multiple agents with heterogeneous data by formulating a primal–dual framework in which local primal updates occur without inter-agent communication and a coordinated dual ascent step governs global agreement. By applying a GA-MSGD approach to the Lagrangian and coupling it with Catalyst acceleration, the authors achieve near-optimal communication complexities across strongly convex, convex, and nonconvex settings without requiring large minibatches, in both centralized and decentralized networks. A key theoretical insight is that the dual function becomes strongly concave when the coupling matrix has full rank (or within the span of the network’s $U=\sqrt{I-W}$ in the decentralized case), which enables linear convergence in the outer loop and rapid reduction of communication rounds. The framework unifies several existing methods under a minimax perspective and demonstrates improvements over prior rates for LED, Scaffnew/ProxSkip, and stochastic gradient tracking, offering practical appeal for scalable distributed learning and optimization tasks.
Abstract
In distributed machine learning, efficient training across multiple agents with different data distributions poses significant challenges. Even with a centralized coordinator, current algorithms that achieve optimal communication complexity typically require either large minibatches or compromise on gradient complexity. In this work, we tackle both centralized and decentralized settings across strongly convex, convex, and nonconvex objectives. We first demonstrate that a basic primal-dual method, (Accelerated) Gradient Ascent Multiple Stochastic Gradient Descent (GA-MSGD), applied to the Lagrangian of distributed optimization inherently incorporates local updates, because the inner loops of running Stochastic Gradient Descent on the primal variable require no inter-agent communication. Notably, for strongly convex objectives, (Accelerated) GA-MSGD achieves linear convergence in communication rounds despite the Lagrangian being only linear in the dual variables. This is due to a structural property where the dual variable is confined to the span of the coupling matrix, rendering the dual problem strongly concave. When integrated with the Catalyst framework, our approach achieves nearly optimal communication complexity across various settings without the need for minibatches.
