Table of Contents
Fetching ...

On the Convergence and Stability of Distributed Sub-model Training

Yuyang Deng, Fuli Qiao, Mehrdad Mahdavi

TL;DR

A distributed shuffled sub-model training, where the full model is partitioned into several sub-models in advance, and the server shuffles those sub-models, sends each of them to clients at each round, and by the end of local updating period, clients send back the updated sub-models, and server averages them.

Abstract

As learning models continue to grow in size, enabling on-device local training of these models has emerged as a critical challenge in federated learning. A popular solution is sub-model training, where the server only distributes randomly sampled sub-models to the edge clients, and clients only update these small models. However, those random sampling of sub-models may not give satisfying convergence performance. In this paper, observing the success of SGD with shuffling, we propose a distributed shuffled sub-model training, where the full model is partitioned into several sub-models in advance, and the server shuffles those sub-models, sends each of them to clients at each round, and by the end of local updating period, clients send back the updated sub-models, and server averages them. We establish the convergence rate of this algorithm. We also study the generalization of distributed sub-model training via stability analysis, and find that the sub-model training can improve the generalization via amplifying the stability of training process. The extensive experiments also validate our theoretical findings.

On the Convergence and Stability of Distributed Sub-model Training

TL;DR

A distributed shuffled sub-model training, where the full model is partitioned into several sub-models in advance, and the server shuffles those sub-models, sends each of them to clients at each round, and by the end of local updating period, clients send back the updated sub-models, and server averages them.

Abstract

As learning models continue to grow in size, enabling on-device local training of these models has emerged as a critical challenge in federated learning. A popular solution is sub-model training, where the server only distributes randomly sampled sub-models to the edge clients, and clients only update these small models. However, those random sampling of sub-models may not give satisfying convergence performance. In this paper, observing the success of SGD with shuffling, we propose a distributed shuffled sub-model training, where the full model is partitioned into several sub-models in advance, and the server shuffles those sub-models, sends each of them to clients at each round, and by the end of local updating period, clients send back the updated sub-models, and server averages them. We establish the convergence rate of this algorithm. We also study the generalization of distributed sub-model training via stability analysis, and find that the sub-model training can improve the generalization via amplifying the stability of training process. The extensive experiments also validate our theoretical findings.

Paper Structure

This paper contains 28 sections, 21 theorems, 186 equations, 4 figures, 4 tables.

Key Result

Theorem 1

Let Assumptions assump: smooth- assumption: bounded grad hold. Then Algorithm algorithm: Masked FedAvg with $\eta = \frac{\log (KR)^2}{\tilde{\mu} KR}$ and $R \geq \frac{L}{\tilde{\mu}} \log(K^2R^2)$ will output the solution $\hat{\mathbf{w}}$, such that the following statement holds: where $\bar{\mu} := \frac{1}{N}\sum_{i=1}^N p_i \mu$, $\tilde{\mu}: = \min_{i\in[N]} p_i \mu$, $\tilde{L}: = \max

Figures (4)

  • Figure 1: Global testing loss/accuracy of rolling and random masking under high data heterogeneity.
  • Figure 2: Global testing loss/accuracy of rolling and random masking with low data heterogeneity.
  • Figure 3: Global testing loss/accuracy of rolling and random masking under the largest and smallest client model capacity under high data heterogeneity.
  • Figure 4: Global testing loss/accuracy of rolling and random masking under the largest and smallest client model capacity under low data heterogeneity.

Theorems & Definitions (47)

  • Definition 1
  • Theorem 1
  • Definition 2
  • Theorem 2
  • Definition 3
  • Theorem 3
  • Remark 1
  • Theorem 4
  • Definition 4
  • Lemma 1
  • ...and 37 more