Table of Contents
Fetching ...

Provably Doubly Accelerated Federated Learning: The First Theoretically Successful Combination of Local Training and Communication Compression

Laurent Condat, Ivan Agarský, Peter Richtárik

TL;DR

This work addresses the communication bottleneck in Federated Learning by proposing 0.958CompressedScaffnew, a novel algorithm that jointly leverages Local Training and Communication Compression. It provides a theoretical framework showing linear convergence to the exact solution in strongly convex settings with a doubly accelerated rate, and derives iteration and total communication complexities that beat prior LT or CC methods. The approach introduces two randomization mechanisms and a tailored compressor design to effectively merge LT and CC, supported by convex-case sublinearergodic results and practical logistic regression experiments. The results have practical impact by enabling faster, communication-efficient FL under asymmetric uplink/downlink costs and diverse model dimensions. Future work may extend to stochastic gradients, partial participation, biased quantization, and nonconvex regimes.

Abstract

In federated learning, a large number of users are involved in a global learning task, in a collaborative way. They alternate local computations and two-way communication with a distant orchestrating server. Communication, which can be slow and costly, is the main bottleneck in this setting. To reduce the communication load and therefore accelerate distributed gradient descent, two strategies are popular: 1) communicate less frequently; that is, perform several iterations of local computations between the communication rounds; and 2) communicate compressed information instead of full-dimensional vectors. We propose the first algorithm for distributed optimization and federated learning, which harnesses these two strategies jointly and converges linearly to an exact solution in the strongly convex setting, with a doubly accelerated rate: our algorithm benefits from the two acceleration mechanisms provided by local training and compression, namely a better dependency on the condition number of the functions and on the dimension of the model, respectively.

Provably Doubly Accelerated Federated Learning: The First Theoretically Successful Combination of Local Training and Communication Compression

TL;DR

This work addresses the communication bottleneck in Federated Learning by proposing 0.958CompressedScaffnew, a novel algorithm that jointly leverages Local Training and Communication Compression. It provides a theoretical framework showing linear convergence to the exact solution in strongly convex settings with a doubly accelerated rate, and derives iteration and total communication complexities that beat prior LT or CC methods. The approach introduces two randomization mechanisms and a tailored compressor design to effectively merge LT and CC, supported by convex-case sublinearergodic results and practical logistic regression experiments. The results have practical impact by enabling faster, communication-efficient FL under asymmetric uplink/downlink costs and diverse model dimensions. Future work may extend to stochastic gradients, partial participation, biased quantization, and nonconvex regimes.

Abstract

In federated learning, a large number of users are involved in a global learning task, in a collaborative way. They alternate local computations and two-way communication with a distant orchestrating server. Communication, which can be slow and costly, is the main bottleneck in this setting. To reduce the communication load and therefore accelerate distributed gradient descent, two strategies are popular: 1) communicate less frequently; that is, perform several iterations of local computations between the communication rounds; and 2) communicate compressed information instead of full-dimensional vectors. We propose the first algorithm for distributed optimization and federated learning, which harnesses these two strategies jointly and converges linearly to an exact solution in the strongly convex setting, with a doubly accelerated rate: our algorithm benefits from the two acceleration mechanisms provided by local training and compression, namely a better dependency on the condition number of the functions and on the dimension of the model, respectively.
Paper Structure (17 sections, 4 theorems, 92 equations, 3 figures, 1 table, 1 algorithm)

This paper contains 17 sections, 4 theorems, 92 equations, 3 figures, 1 table, 1 algorithm.

Key Result

Theorem 3.1

In 0.958CompressedScaffnew, suppose that For every $t\geq 0$, define the Lyapunov function where $x^\star$ is the unique solution to eqpro1 and $h_i^\star = \nabla f_i(x^\star)$. Then 0.958CompressedScaffnew converges linearly: for every $t\geq 0$, where Also, for every $i\in [n]$, $(x_i^t)_{t\in\mathbb{N}}$ and $(\hat{x}_i^t)_{t\in\mathbb{N}}$ both converge to $x^\star$ and $(h_i^t)_{t\in\mat

Figures (3)

  • Figure 1: The random sampling pattern $\mathbf{q}^t=(q_i^t)_{i=1}^n \in \mathbb{R}^{d \times n}$ used for communication is generated by a random permutation of the columns of a fixed binary template pattern, which has the prescribed number $s\geq 2$ of ones in every row. In (a) with $(d,n,s)=(5,6,2)$ and (b) with $(d,n,s)=(5,7,2)$, with ones in blue and zeros in white, examples of the template pattern used when $d\geq \frac{n}{s}$: for every row $k\in [d]$, there are $s$ ones at columns $i=\mathrm{mod}(s(k-1),n)+1,\ldots,\mathrm{mod}(sk-1,n)+1$. Thus, there are $\lfloor \frac{sd}{n} \rfloor$ or $\lceil \frac{sd}{n} \rceil$ ones in every column vector $q_i$. In (c), an example of sampling pattern obtained after a permutation of the columns of the template pattern in (a). In (d) with $(d,n,s)=(3,10,2)$, an example of the template pattern used when $\frac{n}{s}\geq d$: for every column $i=1,\ldots,ds$, there is 1 one at row $k=\mathrm{mod}(i-1,d)+1$. Thus, there is 0 or 1 one in every column vector $q_i$. We can note that when $d= \frac{n}{s}$, the two different rules for $d\geq \frac{n}{s}$ and $\frac{n}{s}\geq d$ for constructing the template pattern are equivalent, since they give exactly the same set of sampling patterns when permuting their columns. These two rules make it possible to generate easily the columns $q_i^t$ of $\mathbf{q}^t$ on the fly, without having to generate the whole mask $\mathbf{q}^t$ explicitly.
  • Figure 2: Logistic regression experiment. The datasets real-sim and w8a have $d=20,958$ and $d=300$ features, respectively. In (a) and (b), $d\approx10n$, whereas in (c) and (d), this is the opposite with $n=10d$.
  • Figure 3: Logistic regression experiment. The setting is the same as in Figure \ref{['fig:totalcost']}, but with $\kappa = 10^6$ instead of $334$.

Theorems & Definitions (6)

  • Theorem 3.1
  • Remark 3.2
  • Theorem 3.3
  • Theorem 4.1
  • Theorem A.1
  • proof