Table of Contents
Fetching ...

SMoFi: Step-wise Momentum Fusion for Split Federated Learning on Heterogeneous Data

Mingkun Yang, Ran Zhu, Qing Wang, Jie Yang

TL;DR

This work tackles the challenge of gradient divergence caused by non-IID data in Split Federated Learning by introducing Step-wise Momentum Fusion (SMoFi). SMoFi synchronizes momentum buffers across parallel server-side optimizers and employs a staleness-aware momentum alignment to maintain consistent training trajectories without altering client-side computation. Theoretical convergence guarantees under partial participation show an $O(1/N)$ rate with explicit bounds, and extensive experiments demonstrate up to 7.1 percentage-point accuracy gains and up to 10.25× faster convergence, especially with more clients and deeper models. Practically, SMoFi provides a lightweight, client-transparent enhancement for real-world split FL deployments in resource-constrained environments.

Abstract

Split Federated Learning is a system-efficient federated learning paradigm that leverages the rich computing resources at a central server to train model partitions. Data heterogeneity across silos, however, presents a major challenge undermining the convergence speed and accuracy of the global model. This paper introduces Step-wise Momentum Fusion (SMoFi), an effective and lightweight framework that counteracts gradient divergence arising from data heterogeneity by synchronizing the momentum buffers across server-side optimizers. To control gradient divergence over the training process, we design a staleness-aware alignment mechanism that imposes constraints on gradient updates of the server-side submodel at each optimization step. Extensive validations on multiple real-world datasets show that SMoFi consistently improves global model accuracy (up to 7.1%) and convergence speed (up to 10.25$\times$). Furthermore, SMoFi has a greater impact with more clients involved and deeper learning models, making it particularly suitable for model training in resource-constrained contexts.

SMoFi: Step-wise Momentum Fusion for Split Federated Learning on Heterogeneous Data

TL;DR

This work tackles the challenge of gradient divergence caused by non-IID data in Split Federated Learning by introducing Step-wise Momentum Fusion (SMoFi). SMoFi synchronizes momentum buffers across parallel server-side optimizers and employs a staleness-aware momentum alignment to maintain consistent training trajectories without altering client-side computation. Theoretical convergence guarantees under partial participation show an rate with explicit bounds, and extensive experiments demonstrate up to 7.1 percentage-point accuracy gains and up to 10.25× faster convergence, especially with more clients and deeper models. Practically, SMoFi provides a lightweight, client-transparent enhancement for real-world split FL deployments in resource-constrained environments.

Abstract

Split Federated Learning is a system-efficient federated learning paradigm that leverages the rich computing resources at a central server to train model partitions. Data heterogeneity across silos, however, presents a major challenge undermining the convergence speed and accuracy of the global model. This paper introduces Step-wise Momentum Fusion (SMoFi), an effective and lightweight framework that counteracts gradient divergence arising from data heterogeneity by synchronizing the momentum buffers across server-side optimizers. To control gradient divergence over the training process, we design a staleness-aware alignment mechanism that imposes constraints on gradient updates of the server-side submodel at each optimization step. Extensive validations on multiple real-world datasets show that SMoFi consistently improves global model accuracy (up to 7.1%) and convergence speed (up to 10.25). Furthermore, SMoFi has a greater impact with more clients involved and deeper learning models, making it particularly suitable for model training in resource-constrained contexts.

Paper Structure

This paper contains 27 sections, 3 theorems, 36 equations, 11 figures, 6 tables, 2 algorithms.

Key Result

Theorem 3.5

Under the Assumptions main_assumption_1, main_assumption_2, main_assumption_3, and main_assumption_4, SMoFi has the similar convergence guarantees with SFLV1 with the momentum SGD as the optimization solver. Given the predefined communication rounds $N$, client participation rate $\theta$, and a sma The $A$, $B$, $C$, and $\gamma$ in the error bound follows $A = \lvert\mathcal{J}\lvert \sum_{j\in\

Figures (11)

  • Figure 1: Momentum in local optimizers improves model performance on both moderate and extreme non-IID data in the long run, albeit slows down the learning.
  • Figure 2: Comparison of server-side model updates $\mathcal{W}_{s}^{(n,0)}\mapsto\mathcal{W}_{s}^{n+1}$ in our SMoFi, and the state-of-the-art SFV1 and SFV2: In SFV1, the server updates surrogate server-side models $\mathcal{W}_{s, j}^{(n,\tau)}$ in parallel, and periodically aggregates them--e.g., after each local epoch as illustrated; In SFV2, the server sequentially interacts with clients to update the server-side model; SMoFi is akin to the SFV1 where the server updates surrogate models in parallel while introduces momentum alignment at each step $\tau$ by synchronizing the momentum buffers $\bar{m}$ across the server-side solvers. Such alignment helps the aggregated model converge toward the global optimum $\mathcal{W}_{s}^{*}$, rather than local optima $\mathcal{W}_{s,1}^{*}$ or $\mathcal{W}_{s,2}^{*}$.
  • Figure 3: Sensitivity study of SMoFi under CIFAR10: (left two) accuracy and convergence under varying staleness factor $\alpha$; (right two) performance under different cut layers $L$. For instance, $L=0$ indicates that all 8 residual blocks and the output block are allocated to the server, while the clients hold only the input block. The dashed line represents the accuracy of FedAvg under the same setting.
  • Figure 4: Learning curves of FedAvg integrated with SMoFi and its counterparts on the CIFAR100 using ResNet-18 under $\mathrm{Dir}_{100}(0.2)$ distribution. From left to right, we investigate different optimizers: SGD with momentum (SGDM), Nesterov Accelerated Gradient (NAG), Adaptive Moment Estimation (Adam), and Adam with decoupled weight decay (AdamW).
  • Figure 5: Robustness analysis on the Tiny-ImageNet dataset under $\mathrm{Dir}_{200}(0.2)$ distribution with various task-specific models: VGG (V), MobileNet (M), ResNet (R), and DenseNet (D). We report the best global model performance (left) and round-to-accuracy performance (right), within 150 communication rounds.
  • ...and 6 more figures

Theorems & Definitions (5)

  • Theorem 3.5
  • Proposition D.1
  • proof
  • Lemma D.2
  • proof