TAMUNA: Doubly Accelerated Distributed Optimization with Local Training, Compression, and Partial Participation

Laurent Condat; Ivan Agarský; Grigory Malinovsky; Peter Richtárik

TAMUNA: Doubly Accelerated Distributed Optimization with Local Training, Compression, and Partial Participation

Laurent Condat, Ivan Agarský, Grigory Malinovsky, Peter Richtárik

TL;DR

This work tackles the communication bottleneck in distributed optimization under federated-like settings with partial participation. It introduces TAMUNA, the first algorithm to jointly leverage local training and compression while supporting partial participation, built on a variance-reduced, control-variate framework with permutation-based compression. In the strongly convex regime, TAMUNA achieves linear convergence to the exact solution and exhibits doubly accelerated convergence with respect to the condition number $\kappa$ and the model dimension $d$, yielding improved total communication complexity. Empirical results on logistic regression with large-scale, heterogeneous data validate that TAMUNA reduces communication rounds and data transmitted to reach a given accuracy, outperforming prior LT and CC baselines. These theoretical and practical advances offer a principled, scalable path for efficient federated optimization in the presence of device dropouts and asymmetric networks.

Abstract

In distributed optimization and learning, several machines alternate between local computations in parallel and communication with a distant server. Communication is usually slow and costly and forms the main bottleneck. This is particularly true in federated learning, where a large number of users collaborate toward a global training task. In addition, it is desirable for a robust algorithm to allow for partial participation, since it is often the case that some clients are not able to participate to the entire process and are idle at certain times. Two strategies are popular to reduce the communication burden: 1) local training, which consists in communicating less frequently, or equivalently performing more local computations between the communication rounds; and 2) compression, whereby compressed information instead of full-dimensional vectors is communicated. We propose TAMUNA, the first algorithm for distributed optimization that leveraged the two strategies of local training and compression jointly and allows for partial participation. In the strongly convex setting, TAMUNA converges linearly to the exact solution and provably benefits from the two mechanisms: it exhibits a doubly-accelerated convergence rate, with respect to the condition number of the functions and the model dimension.

TAMUNA: Doubly Accelerated Distributed Optimization with Local Training, Compression, and Partial Participation

TL;DR

and the model dimension

, yielding improved total communication complexity. Empirical results on logistic regression with large-scale, heterogeneous data validate that TAMUNA reduces communication rounds and data transmitted to reach a given accuracy, outperforming prior LT and CC baselines. These theoretical and practical advances offer a principled, scalable path for efficient federated optimization in the presence of device dropouts and asymmetric networks.

Abstract

Paper Structure (19 sections, 6 theorems, 106 equations, 3 figures, 3 tables, 1 algorithm)

This paper contains 19 sections, 6 theorems, 106 equations, 3 figures, 3 tables, 1 algorithm.

Introduction
Formalism
A model of Asymmetric Communication
Related Work
Local Training (LT)
Partial Participation (PP)
Communication Compression (CC)
Challenges and Contributions
Combining LT and PP
Combining LT and CC
Proposed Algorithm 0.958TAMUNA
Iteration and Communication Complexities
Experiments
Conclusion
Proof of Theorem \ref{['theo1']}
...and 4 more sections

Key Result

Theorem 1

Let $p\in (0,1]$. In 0.958TAMUNA, suppose that at every round $r\geq 0$, $L^{(r)}$ is chosen randomly and independently according to a geometric law of mean $p^{-1}$; that is, for every $L\geq 1$, $\mathrm{Prob}(L^{(r)}=L)=(1-p)^{L-1}p$. Also, suppose that and $\eta \coloneqq p\chi$, where For every total number $t\geq 0$ of local steps made so far, define the Lyapunov function where $x^\star$

Figures (3)

Figure 1: The random sampling pattern $\mathbf{q}^{(r)}=(q_i^{(r)})_{i=1}^c \in \mathbb{R}^{d \times c}$ used for communication is generated by a random permutation of the columns of a fixed binary template pattern, which has the prescribed number $s\geq 2$ of ones in every row. In (a) with $(d,c,s)=(5,6,2)$ and (b) with $(d,c,s)=(5,7,2)$, with ones in blue and zeros in white, examples of the template pattern used when $d\geq \frac{c}{s}$: for every row $k\in [d]$, there are $s$ ones at columns $i=\mathrm{mod}(s(k-1),c)+1,\ldots,\mathrm{mod}(sk-1,c)+1$. Thus, there are $\lfloor \frac{sd}{c} \rfloor$ or $\lceil \frac{sd}{c} \rceil$ ones in every column vector $q_i$. In (c), an example of sampling pattern obtained after a permutation of the columns of the template pattern in (a). In (d) with $(d,c,s)=(3,10,2)$, an example of the template pattern used when $\frac{c}{s}\geq d$: for every column $i=1,\ldots,ds$, there is 1 one at row $k=\mathrm{mod}(i-1,d)+1$. Thus, there is 0 or 1 one in every column vector $q_i$. We can note that when $d= \frac{c}{s}$, the two different rules for $d\geq \frac{c}{s}$ and $\frac{c}{s}\geq d$ for constructing the template pattern are equivalent, since they give exactly the same set of sampling patterns when permuting their columns. These two rules make it possible to generate easily the columns $q_i^{(r)}$ of $\mathbf{q}^{(r)}$ on the fly, without having to generate the whole mask $\mathbf{q}^{(r)}$ explicitly. This compression mechanism is the same as in 0.958CompressedScaffnew and this figure is the same as Figure 1 in our previous paper con22cs.
Figure 2: Logistic regression experiment in the case $n>d$. The dataset w8a has $d=300$ features and $n=1000$, so $n \approx 3d$. The first row shows a comparison in the full participation regime, while the second row shows a comparison in the partial participation regime with 10% of clients. On the left, $\alpha=0$, while on the right, $\alpha=0.1$.
Figure 3: Logistic regression experiment in the case $d>n$. The dataset real-sim has $d=20,958$ features and $n=1000$, so $n \approx d/20$. The first row shows a comparison in the full participation regime, while the second row shows a comparison in the partial participation regime with 10% of clients. On the left, $\alpha=0$, while on the right, $\alpha=0.1$.

Theorems & Definitions (9)

Theorem 1: fast linear convergence to a $\sigma^2$-neighborhood
Remark 2: setting $\eta$
Theorem 3: doubly accelerated communication
Corollary 4: dependence on $\alpha$
Corollary 5: full participation
Theorem 6: fast linear convergence
proof
Theorem 7: sublinear convergence
proof

TAMUNA: Doubly Accelerated Distributed Optimization with Local Training, Compression, and Partial Participation

TL;DR

Abstract

TAMUNA: Doubly Accelerated Distributed Optimization with Local Training, Compression, and Partial Participation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (9)