Table of Contents
Fetching ...

Pareto-optimal Trade-offs Between Communication and Computation with Flexible Gradient Tracking

Yan Huang, Jinming Xu, Li Chai, Jiming Chen, Karl H. Johansson

TL;DR

The paper tackles distributed stochastic optimization with non-i.i.d. data by introducing FlexGT, a flexible gradient-tracking method with tunable local updates $\beta$ and communications $\alpha$ per round, augmented by an accelerated variant Acc-FlexGT that leverages prior graph knowledge to achieve Pareto-optimal trade-offs between communication and computation. A unified convergence framework is developed for strongly convex, convex, and nonconvex objectives, yielding explicit dependencies on $L$, $\mu$, $\sigma$, $n$, $\rho_W$, $\alpha$, and $\beta$, and demonstrating linear or sublinear rates with controllable consensus and gradient-tracking errors. Acc-FlexGT achieves Pareto-optimal trade-offs, with nonconvex iteration complexity $\tilde{\mathcal O}\left( \dfrac{L\sigma^2}{n\epsilon^2}+\dfrac{L}{\epsilon\sqrt{1-\sqrt{\rho_W}}} \right)$ and communication complexity $\tilde{\mathcal O}\left( \dfrac{L}{\epsilon\sqrt{1-\sqrt{\rho_W}}} \right)$, matching lower bounds up to logarithmic terms, and improving strongly convex results by a factor of $\tilde{\mathcal O}(1/\sqrt{\epsilon})$. The framework unifies and extends prior gradient-tracking methods, offering practical guidance for balancing communication and computation in heterogeneous networks, with empirical validation on synthetic data and MNIST supporting the theoretical gains.

Abstract

This paper addresses distributed stochastic optimization problems under non-i.i.d. data, focusing on the inherent trade-offs between communication and computational efficiency. To this end, we propose FlexGT, a flexible snapshot gradient tracking method that enables tunable numbers of local updates and neighbor communications per round, thereby adapting efficiently to diverse system resource conditions. Leveraging a unified convergence analysis framework, we derive tight communication and computational complexity for FlexGT with explicit dependence on objective properties and certain tunable parameters. Moreover, we introduce an accelerated variant, termed Acc-FlexGT, and prove that, with prior knowledge of the graph, it achieves Pareto-optimal trade-offs between communication and computation. Particularly, in the nonconvex case, Acc-FlexGT achieves the optimal iteration complexity of $\tilde{\mathcal{O}}\left( \left( Lσ^2 \right) /\left( nε^2 \right) +L/\left( ε\sqrt{1-\sqrt{ρ_W}} \right) \right) $ and optimal communication complexity of $\tilde{\mathcal{O}}\left( L/\left( ε\sqrt{1-\sqrt{ρ_W}} \right) \right)$ for appropriately chosen numbers of local updates, matching existing lower bounds up to logarithmic factors. And, it improves the existing results for the strongly convex case by a factor of $\tilde{\mathcal{O}} \left( 1/\sqrtε \right)$, where $ε$ is the targeted accuracy, $n$ the number of nodes, $L$ the Lipschitz constant, $ρ_W$ the connectivity of the graph, and $σ$ the stochastic gradient variance. Numerical experiments corroborate the theoretical results and demonstrate the effectiveness of the proposed methods.

Pareto-optimal Trade-offs Between Communication and Computation with Flexible Gradient Tracking

TL;DR

The paper tackles distributed stochastic optimization with non-i.i.d. data by introducing FlexGT, a flexible gradient-tracking method with tunable local updates and communications per round, augmented by an accelerated variant Acc-FlexGT that leverages prior graph knowledge to achieve Pareto-optimal trade-offs between communication and computation. A unified convergence framework is developed for strongly convex, convex, and nonconvex objectives, yielding explicit dependencies on , , , , , , and , and demonstrating linear or sublinear rates with controllable consensus and gradient-tracking errors. Acc-FlexGT achieves Pareto-optimal trade-offs, with nonconvex iteration complexity and communication complexity , matching lower bounds up to logarithmic terms, and improving strongly convex results by a factor of . The framework unifies and extends prior gradient-tracking methods, offering practical guidance for balancing communication and computation in heterogeneous networks, with empirical validation on synthetic data and MNIST supporting the theoretical gains.

Abstract

This paper addresses distributed stochastic optimization problems under non-i.i.d. data, focusing on the inherent trade-offs between communication and computational efficiency. To this end, we propose FlexGT, a flexible snapshot gradient tracking method that enables tunable numbers of local updates and neighbor communications per round, thereby adapting efficiently to diverse system resource conditions. Leveraging a unified convergence analysis framework, we derive tight communication and computational complexity for FlexGT with explicit dependence on objective properties and certain tunable parameters. Moreover, we introduce an accelerated variant, termed Acc-FlexGT, and prove that, with prior knowledge of the graph, it achieves Pareto-optimal trade-offs between communication and computation. Particularly, in the nonconvex case, Acc-FlexGT achieves the optimal iteration complexity of and optimal communication complexity of for appropriately chosen numbers of local updates, matching existing lower bounds up to logarithmic factors. And, it improves the existing results for the strongly convex case by a factor of , where is the targeted accuracy, the number of nodes, the Lipschitz constant, the connectivity of the graph, and the stochastic gradient variance. Numerical experiments corroborate the theoretical results and demonstrate the effectiveness of the proposed methods.

Paper Structure

This paper contains 20 sections, 11 theorems, 83 equations, 6 figures, 3 tables, 1 algorithm.

Key Result

Lemma 1

Suppose Assumption Ass_graph holds and denote $\bar{\rho}_W:=\left\| \bar{W}-\mathbf{J} \right\| ^2$. Then, for FlexGT, we have and for Acc-FlexGT,

Figures (6)

  • Figure 1: Flow diagram of the computation and communication of FlexGT. The variable $\psi_{i,l}$, $i=1,\dots,n$, represents the collection of $x_{i,l}$, $y_{i,l}$, and $z_{i,l}$. Only $x_{i,l}$ and $y_{i,l}$ are communicated. The communication is weighted by the matrix $W$.
  • Figure 2: The number of communication and computation steps needed for FlexGT to achieve an accuracy of $\epsilon = 10^{-4}$ with $\alpha, \beta =1,2,\dots,100$. Pareto-optimal solutions (red lines) are achieved with $\alpha = 4$ and $\alpha = 32$ for strongly convex (left) and nonconvex (right) cases, respectively.
  • Figure 3: Basic idea of the unified convergence analysis framework. The solid arrows show the dependency of the terms. The dashed arrows indicate decoupling between the two accumulated terms. The dashed box outlines the Lyapunov and the accumulation methods for (strongly) convex and nonconvex objective functions, respectively.
  • Figure 4: Communication and computational complexity of FlexGT algorithm to achieve $\epsilon=10^{-3}$ accuracy with $\alpha=1,2,\dots,8$ and $\beta=1,2,\dots,8$ on synthetic data. Each node is in an exponential graph of $n=20$ nodes with 5 neighbors.
  • Figure 5: Comparison of convergence performance on synthetic data. Each node in an exponential graph of $n=20$ nodes with $\left| \mathcal{N} _i \right|$ neighbors.
  • ...and 1 more figures

Theorems & Definitions (25)

  • Definition 1: Pareto-optimality boyd2004convexchen2013distributed
  • Lemma 1
  • Theorem 1
  • proof
  • Remark 1
  • Corollary 1
  • proof
  • Remark 2
  • Corollary 2
  • proof
  • ...and 15 more