Table of Contents
Fetching ...

Unity is Power: Semi-Asynchronous Collaborative Training of Large-Scale Models with Structured Pruning in Resource-Limited Clients

Yan Li, Xiao Zhang, Mingyi Li, Guangwei Xu, Feng Chen, Yuan Yuan, Yifei Zou, Mengying Zhao, Jianbo Lu, Dongxiao Yu

TL;DR

This paper tackles the challenge of training very large models across many resource-limited devices with heterogeneous data. It introduces Co-S^2P, a semi-asynchronous collaborative framework that combines data distribution-aware structured pruning, cross-block knowledge transfer via self-distillation, and a semi-asynchronous aggregation strategy to mitigate stragglers. The authors provide a convergence analysis showing a rate of $O(1/\sqrt{N^{*}EQ})$ under standard assumptions and demonstrate substantial practical gains on real IoT hardware, including up to 8.8% improvements in server accuracy and reductions in memory and training time. The approach generalizes across vision and NLP tasks and scales to large models, indicating strong potential for resource-constrained, distributed training environments.

Abstract

In this work, we study to release the potential of massive heterogeneous weak computing power to collaboratively train large-scale models on dispersed datasets. In order to improve both efficiency and accuracy in resource-adaptive collaborative learning, we take the first step to consider the \textit{unstructured pruning}, \textit{varying submodel architectures}, \textit{knowledge loss}, and \textit{straggler} challenges simultaneously. We propose a novel semi-asynchronous collaborative training framework, namely ${Co\text{-}S}^2{P}$, with data distribution-aware structured pruning and cross-block knowledge transfer mechanism to address the above concerns. Furthermore, we provide theoretical proof that ${Co\text{-}S}^2{P}$ can achieve asymptotic optimal convergence rate of $O(1/\sqrt{N^*EQ})$. Finally, we conduct extensive experiments on two types of tasks with a real-world hardware testbed including diverse IoT devices.The experimental results demonstrate that $Co\text{-}S^2P$ improves accuracy by up to 8.8\% and resource utilization by up to 1.2$\times$ compared to state-of-the-art methods, while reducing memory consumption by approximately 22\% and training time by about 24\% on all resource-limited devices.

Unity is Power: Semi-Asynchronous Collaborative Training of Large-Scale Models with Structured Pruning in Resource-Limited Clients

TL;DR

This paper tackles the challenge of training very large models across many resource-limited devices with heterogeneous data. It introduces Co-S^2P, a semi-asynchronous collaborative framework that combines data distribution-aware structured pruning, cross-block knowledge transfer via self-distillation, and a semi-asynchronous aggregation strategy to mitigate stragglers. The authors provide a convergence analysis showing a rate of under standard assumptions and demonstrate substantial practical gains on real IoT hardware, including up to 8.8% improvements in server accuracy and reductions in memory and training time. The approach generalizes across vision and NLP tasks and scales to large models, indicating strong potential for resource-constrained, distributed training environments.

Abstract

In this work, we study to release the potential of massive heterogeneous weak computing power to collaboratively train large-scale models on dispersed datasets. In order to improve both efficiency and accuracy in resource-adaptive collaborative learning, we take the first step to consider the \textit{unstructured pruning}, \textit{varying submodel architectures}, \textit{knowledge loss}, and \textit{straggler} challenges simultaneously. We propose a novel semi-asynchronous collaborative training framework, namely , with data distribution-aware structured pruning and cross-block knowledge transfer mechanism to address the above concerns. Furthermore, we provide theoretical proof that can achieve asymptotic optimal convergence rate of . Finally, we conduct extensive experiments on two types of tasks with a real-world hardware testbed including diverse IoT devices.The experimental results demonstrate that improves accuracy by up to 8.8\% and resource utilization by up to 1.2 compared to state-of-the-art methods, while reducing memory consumption by approximately 22\% and training time by about 24\% on all resource-limited devices.

Paper Structure

This paper contains 15 sections, 2 theorems, 29 equations, 12 figures, 11 tables, 2 algorithms.

Key Result

Theorem 1

Let all assumptions hold. Suppose that the step size $\eta$ satisfies the following relationships: Therefore, the step size $\eta$ is defined as: $0\leq \eta \leq \frac{1}{4LE}.$ Then, for all $Q\geqslant 1$, we have : where $\frac{1}{Q}\sum_{q=1}^Q(\tau_q)^2=\tau$, $|K|$ is the number of segment, $N^*=\min_{q,i}|N_q^i|$, means the minimum number of submodels training the corresponding segment $

Figures (12)

  • Figure 1: Unstructured pruning fails to train the splitted submodels due to memory constraints in the real-world resource-limited clients.
  • Figure 2: Overview of $Co\text{-}S^2P$. ① The server prunes blocks at depth top-to-bottom and freezes the shallow blocks bottom-to-top according to the available resources of clients. ② The clients train structured width-based masks based on local datasets. ③ The clients train strutrued-pruned submodels using self-distillation to implement cross-block knowledge transfer. ④ We design a semi-asynchronous aggregation strategy to mitigate the problem of stragglers.
  • Figure 3: Details of data distribution-aware balanced structured pruning in both depth and width dimensions. In the depth dimension, the server prune the Transformer blocks top-to-bottom and freeze the blocks bottom-to-top. Subsequently, the heterogeneous clients prune the depth-pruned submodel by training the width-based structured masks based on the local data distribution.
  • Figure 4: Cross-block knowledge transfer. In structured-pruned submodel 1, the high-level knowledge transfers from block $B_3$ to blocks $B_1$ and $B_2$ through self-distillation. Through the segment-wise weight aggregation strategy, the high-level knowledge is further transferred in blocks at the same location.
  • Figure 5: Overview of the main testbed platform, including 4 types of Jetson devices as the clients and RTX3090 GPU as the server.
  • ...and 7 more figures

Theorems & Definitions (9)

  • Definition 1
  • Remark 1
  • Theorem 1
  • Remark 2
  • Remark 3
  • Remark 4
  • Remark 5
  • Corollary 1
  • Remark 6