Table of Contents
Fetching ...

MSfusion: A Dynamic Model Splitting Approach for Resource-Constrained Machines to Collaboratively Train Larger Models

Jin Xie, Songze Li

TL;DR

The paper tackles the challenge of training large models on resource-constrained, decentralized devices by proposing MSfusion, a model-splitting framework with a double shifting scheme, overlap aggregation, and a contrastive objective to curb drift across heterogeneous data. It demonstrates that training sub-models in a fully decentralized setting can achieve competitive accuracy with substantially reduced computation and communication, and that performance scales favorably as more participants join. The key innovations—double shifting splitting, adaptive overlap, and a contrastive loss—drive strong results on CV and NLP benchmarks, outperforming state-of-the-art PT-based and KD-based approaches under non-ideal network and data conditions. The work highlights practical potential for private, serverless collaboration and lays groundwork for privacy-preserving extensions such as DP and secure aggregation.

Abstract

Training large models requires a large amount of data, as well as abundant computation resources. While collaborative learning (e.g., federated learning) provides a promising paradigm to harness collective data from many participants, training large models remains a major challenge for participants with limited resources like mobile devices. We introduce MSfusion, an effective and efficient collaborative learning framework, tailored for training larger models on resourceconstraint machines through model splitting. Specifically, a double shifting model splitting scheme is designed such that in each training round, each participant is assigned a subset of model parameters to train over local data, and aggregates with sub-models of other peers on common parameters. While model splitting significantly reduces the computation and communication costs of individual participants, additional novel designs on adaptive model overlapping and contrastive loss functions help MSfusion to maintain training effectiveness, against model shift across participants. Extensive experiments on image and NLP tasks illustrate significant advantages of MSfusion in performance and efficiency for training large models, and its strong scalability: computation cost of each participant reduces significantly as the number of participants increases.

MSfusion: A Dynamic Model Splitting Approach for Resource-Constrained Machines to Collaboratively Train Larger Models

TL;DR

The paper tackles the challenge of training large models on resource-constrained, decentralized devices by proposing MSfusion, a model-splitting framework with a double shifting scheme, overlap aggregation, and a contrastive objective to curb drift across heterogeneous data. It demonstrates that training sub-models in a fully decentralized setting can achieve competitive accuracy with substantially reduced computation and communication, and that performance scales favorably as more participants join. The key innovations—double shifting splitting, adaptive overlap, and a contrastive loss—drive strong results on CV and NLP benchmarks, outperforming state-of-the-art PT-based and KD-based approaches under non-ideal network and data conditions. The work highlights practical potential for private, serverless collaboration and lays groundwork for privacy-preserving extensions such as DP and secure aggregation.

Abstract

Training large models requires a large amount of data, as well as abundant computation resources. While collaborative learning (e.g., federated learning) provides a promising paradigm to harness collective data from many participants, training large models remains a major challenge for participants with limited resources like mobile devices. We introduce MSfusion, an effective and efficient collaborative learning framework, tailored for training larger models on resourceconstraint machines through model splitting. Specifically, a double shifting model splitting scheme is designed such that in each training round, each participant is assigned a subset of model parameters to train over local data, and aggregates with sub-models of other peers on common parameters. While model splitting significantly reduces the computation and communication costs of individual participants, additional novel designs on adaptive model overlapping and contrastive loss functions help MSfusion to maintain training effectiveness, against model shift across participants. Extensive experiments on image and NLP tasks illustrate significant advantages of MSfusion in performance and efficiency for training large models, and its strong scalability: computation cost of each participant reduces significantly as the number of participants increases.
Paper Structure (19 sections, 1 theorem, 29 equations, 9 figures, 6 tables, 1 algorithm)

This paper contains 19 sections, 1 theorem, 29 equations, 9 figures, 6 tables, 1 algorithm.

Key Result

Theorem 4

shulgin2023towards Consider a distributed learning setting with learning process shown in (com_grad) for a quadratic problem (quadratic) with $\overline{\mathbf{L}} \succ 0$ and $b_i\equiv 0$. Then for $\overline{A} :=\frac{1}{2} \mathbb{E} [\overline{\mathbf{L}}\overline{\mathbf{B}}^r+\overline{\ma and for a step size $\eta (0<\eta<\frac{1}{\xi} )$ the iterates satisfy the following: and where

Figures (9)

  • Figure 1: Overview of MSfusion.
  • Figure 2: Difference between DSS and previous model splitting schemes.
  • Figure 3: Performance comparisons on non-IID CIFAR100, WikiText2 and WikiText103 datasets.
  • Figure 4: Performance of MSfusion for different numbers of participants.
  • Figure 5: (a) Illustrations of ring and fully-connected network topology; (b) Performance of MSfusion under ring and fully-connected topology; (c) Performance comparison under ring topology ( MSfusion$\mu=25\%$ for all participants).
  • ...and 4 more figures

Theorems & Definitions (2)

  • Definition 1
  • Theorem 4