Table of Contents
Fetching ...

TAMUNA: Doubly Accelerated Distributed Optimization with Local Training, Compression, and Partial Participation

Laurent Condat, Ivan Agarský, Grigory Malinovsky, Peter Richtárik

TL;DR

This work tackles the communication bottleneck in distributed optimization under federated-like settings with partial participation. It introduces TAMUNA, the first algorithm to jointly leverage local training and compression while supporting partial participation, built on a variance-reduced, control-variate framework with permutation-based compression. In the strongly convex regime, TAMUNA achieves linear convergence to the exact solution and exhibits doubly accelerated convergence with respect to the condition number $\kappa$ and the model dimension $d$, yielding improved total communication complexity. Empirical results on logistic regression with large-scale, heterogeneous data validate that TAMUNA reduces communication rounds and data transmitted to reach a given accuracy, outperforming prior LT and CC baselines. These theoretical and practical advances offer a principled, scalable path for efficient federated optimization in the presence of device dropouts and asymmetric networks.

Abstract

In distributed optimization and learning, several machines alternate between local computations in parallel and communication with a distant server. Communication is usually slow and costly and forms the main bottleneck. This is particularly true in federated learning, where a large number of users collaborate toward a global training task. In addition, it is desirable for a robust algorithm to allow for partial participation, since it is often the case that some clients are not able to participate to the entire process and are idle at certain times. Two strategies are popular to reduce the communication burden: 1) local training, which consists in communicating less frequently, or equivalently performing more local computations between the communication rounds; and 2) compression, whereby compressed information instead of full-dimensional vectors is communicated. We propose TAMUNA, the first algorithm for distributed optimization that leveraged the two strategies of local training and compression jointly and allows for partial participation. In the strongly convex setting, TAMUNA converges linearly to the exact solution and provably benefits from the two mechanisms: it exhibits a doubly-accelerated convergence rate, with respect to the condition number of the functions and the model dimension.

TAMUNA: Doubly Accelerated Distributed Optimization with Local Training, Compression, and Partial Participation

TL;DR

This work tackles the communication bottleneck in distributed optimization under federated-like settings with partial participation. It introduces TAMUNA, the first algorithm to jointly leverage local training and compression while supporting partial participation, built on a variance-reduced, control-variate framework with permutation-based compression. In the strongly convex regime, TAMUNA achieves linear convergence to the exact solution and exhibits doubly accelerated convergence with respect to the condition number and the model dimension , yielding improved total communication complexity. Empirical results on logistic regression with large-scale, heterogeneous data validate that TAMUNA reduces communication rounds and data transmitted to reach a given accuracy, outperforming prior LT and CC baselines. These theoretical and practical advances offer a principled, scalable path for efficient federated optimization in the presence of device dropouts and asymmetric networks.

Abstract

In distributed optimization and learning, several machines alternate between local computations in parallel and communication with a distant server. Communication is usually slow and costly and forms the main bottleneck. This is particularly true in federated learning, where a large number of users collaborate toward a global training task. In addition, it is desirable for a robust algorithm to allow for partial participation, since it is often the case that some clients are not able to participate to the entire process and are idle at certain times. Two strategies are popular to reduce the communication burden: 1) local training, which consists in communicating less frequently, or equivalently performing more local computations between the communication rounds; and 2) compression, whereby compressed information instead of full-dimensional vectors is communicated. We propose TAMUNA, the first algorithm for distributed optimization that leveraged the two strategies of local training and compression jointly and allows for partial participation. In the strongly convex setting, TAMUNA converges linearly to the exact solution and provably benefits from the two mechanisms: it exhibits a doubly-accelerated convergence rate, with respect to the condition number of the functions and the model dimension.
Paper Structure (19 sections, 6 theorems, 106 equations, 3 figures, 3 tables, 1 algorithm)

This paper contains 19 sections, 6 theorems, 106 equations, 3 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

Let $p\in (0,1]$. In 0.958TAMUNA, suppose that at every round $r\geq 0$, $L^{(r)}$ is chosen randomly and independently according to a geometric law of mean $p^{-1}$; that is, for every $L\geq 1$, $\mathrm{Prob}(L^{(r)}=L)=(1-p)^{L-1}p$. Also, suppose that and $\eta \coloneqq p\chi$, where For every total number $t\geq 0$ of local steps made so far, define the Lyapunov function where $x^\star$

Figures (3)

  • Figure 1: The random sampling pattern $\mathbf{q}^{(r)}=(q_i^{(r)})_{i=1}^c \in \mathbb{R}^{d \times c}$ used for communication is generated by a random permutation of the columns of a fixed binary template pattern, which has the prescribed number $s\geq 2$ of ones in every row. In (a) with $(d,c,s)=(5,6,2)$ and (b) with $(d,c,s)=(5,7,2)$, with ones in blue and zeros in white, examples of the template pattern used when $d\geq \frac{c}{s}$: for every row $k\in [d]$, there are $s$ ones at columns $i=\mathrm{mod}(s(k-1),c)+1,\ldots,\mathrm{mod}(sk-1,c)+1$. Thus, there are $\lfloor \frac{sd}{c} \rfloor$ or $\lceil \frac{sd}{c} \rceil$ ones in every column vector $q_i$. In (c), an example of sampling pattern obtained after a permutation of the columns of the template pattern in (a). In (d) with $(d,c,s)=(3,10,2)$, an example of the template pattern used when $\frac{c}{s}\geq d$: for every column $i=1,\ldots,ds$, there is 1 one at row $k=\mathrm{mod}(i-1,d)+1$. Thus, there is 0 or 1 one in every column vector $q_i$. We can note that when $d= \frac{c}{s}$, the two different rules for $d\geq \frac{c}{s}$ and $\frac{c}{s}\geq d$ for constructing the template pattern are equivalent, since they give exactly the same set of sampling patterns when permuting their columns. These two rules make it possible to generate easily the columns $q_i^{(r)}$ of $\mathbf{q}^{(r)}$ on the fly, without having to generate the whole mask $\mathbf{q}^{(r)}$ explicitly. This compression mechanism is the same as in 0.958CompressedScaffnew and this figure is the same as Figure 1 in our previous paper con22cs.
  • Figure 2: Logistic regression experiment in the case $n>d$. The dataset w8a has $d=300$ features and $n=1000$, so $n \approx 3d$. The first row shows a comparison in the full participation regime, while the second row shows a comparison in the partial participation regime with 10% of clients. On the left, $\alpha=0$, while on the right, $\alpha=0.1$.
  • Figure 3: Logistic regression experiment in the case $d>n$. The dataset real-sim has $d=20,958$ features and $n=1000$, so $n \approx d/20$. The first row shows a comparison in the full participation regime, while the second row shows a comparison in the partial participation regime with 10% of clients. On the left, $\alpha=0$, while on the right, $\alpha=0.1$.

Theorems & Definitions (9)

  • Theorem 1: fast linear convergence to a $\sigma^2$-neighborhood
  • Remark 2: setting $\eta$
  • Theorem 3: doubly accelerated communication
  • Corollary 4: dependence on $\alpha$
  • Corollary 5: full participation
  • Theorem 6: fast linear convergence
  • proof
  • Theorem 7: sublinear convergence
  • proof