Cohort Squeeze: Beyond a Single Communication Round per Cohort in Cross-Device Federated Learning

Kai Yi; Timur Kharisov; Igor Sokolov; Peter Richtárik

Cohort Squeeze: Beyond a Single Communication Round per Cohort in Cross-Device Federated Learning

Kai Yi, Timur Kharisov, Igor Sokolov, Peter Richtárik

TL;DR

This paper tackles the high-communication cost of cross-device federated learning by challenging the traditional single-round-per-cohort paradigm. It introduces 0.90SPPM-AS, a stochastic proximal point method with arbitrary cohort sampling that allows multiple local proximal updates within each global iteration, reducing total communication cost while preserving convergence guarantees. The authors develop a unified theory around sampling distributions, proximal updates, and iteration complexity, and instantiate NICE, BS, and SS sampling schemes—with stratified sampling often yielding the best variance properties. Extensive experiments on convex logistic regression and non-convex neural networks (including FEMNIST) show up to 74% communication-cost reduction in standard FL and even higher savings in hierarchical FL, validating both the approach and the practical tuning guidelines. Overall, the work provides a principled path to more communication-efficient cross-device FL by leveraging flexible sampling and multi-round cohort interactions.

Abstract

Virtually all federated learning (FL) methods, including FedAvg, operate in the following manner: i) an orchestrating server sends the current model parameters to a cohort of clients selected via certain rule, ii) these clients then independently perform a local training procedure (e.g., via SGD or Adam) using their own training data, and iii) the resulting models are shipped to the server for aggregation. This process is repeated until a model of suitable quality is found. A notable feature of these methods is that each cohort is involved in a single communication round with the server only. In this work we challenge this algorithmic design primitive and investigate whether it is possible to ``squeeze more juice" out of each cohort than what is possible in a single communication round. Surprisingly, we find that this is indeed the case, and our approach leads to up to 74% reduction in the total communication cost needed to train a FL model in the cross-device setting. Our method is based on a novel variant of the stochastic proximal point method (SPPM-AS) which supports a large collection of client sampling procedures some of which lead to further gains when compared to classical client selection approaches.

Cohort Squeeze: Beyond a Single Communication Round per Cohort in Cross-Device Federated Learning

TL;DR

Abstract

Paper Structure (61 sections, 9 theorems, 80 equations, 15 figures, 5 tables, 4 algorithms)

This paper contains 61 sections, 9 theorems, 80 equations, 15 figures, 5 tables, 4 algorithms.

Introduction
Method
Sampling Distribution
Core Algorithm
Theorem interpretation.
Interpolation regime.
A single step travels far.
Iteration complexity.
General Framework.
Arbitrary Sampling Examples
Nice Sampling (NICE).
Block Sampling (BS).
Stratified Sampling (SS).
Stratified Sampling Outperforms Block Sampling and Nice Sampling in Convergence Neighborhood.
Experiments
...and 46 more sections

Key Result

Theorem 4

Let asm:differential (differentiability) and asm:strongly_convex (strong convexity) hold. Let $\mathcal{S}$ be a sampling satisfying asm:valid_sampling, and define Let $x_0 \in \mathbb{R}^d$ be an arbitrary starting point. Then for any $t \geq 0$ and any $\gamma>0$, the iterates of 0.90SPPM-AS (alg:sppm_as) satisfy

Figures (15)

Figure 1: The total communication cost (defined as $TK$) with the number of local communication rounds $K$ needed to reach the target accuracy $\epsilon$ for the chosen cohort in each global iteration. The dashed red line depicts the communication cost of the 0.90FedAvg algorithm. Markers indicate the $TK$ value for different learning rates $\gamma$ of our algorithm 0.90SPPM-AS.
Figure 2: Analysis of total communication costs against local communication rounds for computing the proximal operator. For 0.90LocalGD, we align the x-axis to the total local iterations, highlighting the absence of local communication. The aim is to minimize total communication for achieving a predefined global accuracy $\epsilon$, where ${\left\lVert x_T - x_\star\right\rVert}^2<\epsilon$. The optimal step size and minibatch sampling setup for 0.90LocalGD are denoted as 0.90LocalGD, optim. This showcases a comparison across varying $\epsilon$ values and proximal operator solvers (0.90CG and 0.90BFGS).
Figure 3: The first column compares sampling methods, while the right two columns analyze convergence relative to popular baselines. $\gamma=1.0$.
Figure 4: The left column shows the Server-hub-client hierarchical FL architecture. For the right two columns: on the left, communication cost for achieving 70% accuracy in hierarchical FL ($c_1=0.05$, $c_2=1$); on the right, convergence with optimal hyperparameters ($c_1=0.05$, $c_2=1$).
Figure 5: t-SNE visualization of cluster-features across data samples on clients.
...and 10 more figures

Theorems & Definitions (22)

Theorem 4: Convergence of 0.90SPPM-AS
Lemma 5: Variance Reduction Due to Stratified Sampling
Lemma 6
Theorem 7: FedProx-SPPM-AS convergence
proof
proof
Lemma 8
proof
proof
Theorem 12: Main Theorem
...and 12 more

Cohort Squeeze: Beyond a Single Communication Round per Cohort in Cross-Device Federated Learning

TL;DR

Abstract

Cohort Squeeze: Beyond a Single Communication Round per Cohort in Cross-Device Federated Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (22)