PiPar: Pipeline Parallelism for Collaborative Machine Learning

Zihan Zhang; Philip Rodgers; Peter Kilpatrick; Ivor Spence; Blesson Varghese

PiPar: Pipeline Parallelism for Collaborative Machine Learning

Zihan Zhang, Philip Rodgers, Peter Kilpatrick, Ivor Spence, Blesson Varghese

TL;DR

PiPar tackles under-utilization in privacy-preserving collaborative learning by introducing pipeline parallelism that splits DNNs across device and server, reorders training stages, and overlaps computation with communication. It combines three pipeline construction phases—DNN splitting, stage reordering, and multi-device parallelization—with an automated parameter selection (APS) mechanism to choose split points $P$ and parallel batch counts $N$. Theoretical analysis shows that splitting and reordering do not degrade convergence or accuracy, and empirical results demonstrate up to 34.6x faster training and up to 64.1x reduction in idle time compared to FL, plus robust performance under heterogeneous devices and differential privacy. APS reduces the need for exhaustive search, delivering near-optimal parameter choices with negligible overhead. Overall, PiPar provides a practical, scalable approach for accelerating privacy-preserving CML in real-world, bandwidth-variant environments.

Abstract

Collaborative machine learning (CML) techniques, such as federated learning, have been proposed to train deep learning models across multiple mobile devices and a server. CML techniques are privacy-preserving as a local model that is trained on each device instead of the raw data from the device is shared with the server. However, CML training is inefficient due to low resource utilization. We identify idling resources on the server and devices due to sequential computation and communication as the principal cause of low resource utilization. A novel framework PiPar that leverages pipeline parallelism for CML techniques is developed to substantially improve resource utilization. A new training pipeline is designed to parallelize the computations on different hardware resources and communication on different bandwidth resources, thereby accelerating the training process in CML. A low overhead automated parameter selection method is proposed to optimize the pipeline, maximizing the utilization of available resources. The experimental results confirm the validity of the underlying approach of PiPar and highlight that when compared to federated learning: (i) the idle time of the server can be reduced by up to 64.1x, and (ii) the overall training time can be accelerated by up to 34.6x under varying network conditions for a collection of six small and large popular deep neural networks and four datasets without sacrificing accuracy. It is also experimentally demonstrated that PiPar achieves performance benefits when incorporating differential privacy methods and operating in environments with heterogeneous devices and changing bandwidths.

PiPar: Pipeline Parallelism for Collaborative Machine Learning

TL;DR

and parallel batch counts

. Theoretical analysis shows that splitting and reordering do not degrade convergence or accuracy, and empirical results demonstrate up to 34.6x faster training and up to 64.1x reduction in idle time compared to FL, plus robust performance under heterogeneous devices and differential privacy. APS reduces the need for exhaustive search, delivering near-optimal parameter choices with negligible overhead. Overall, PiPar provides a practical, scalable approach for accelerating privacy-preserving CML in real-world, bandwidth-variant environments.

Abstract

Paper Structure (32 sections, 28 equations, 14 figures, 8 tables, 2 algorithms)

This paper contains 32 sections, 28 equations, 14 figures, 8 tables, 2 algorithms.

Introduction
Background and Related Work
Background
Federated learning
Split learning
Split federated learning
Related work
Improving resource utilization using pipeline parallelism
Reducing the impact of stragglers
Reducing communication overhead
PiPar
Motivation
Pipeline construction
Automated parameter selection
Convergence analysis
...and 17 more sections

Figures (14)

Figure 1: Training of CML methods, assuming $K$ devices. The training steps (circled numbers) are explained in Section \ref{['subsec:bg']}
Figure 2: Pipelines for one training iteration in conventional training and PiPar when using a split DNN. "Comp" is an abbreviation for computation. $f$, $b$, $u$ and $d$ represent forward pass, backward pass, upload and download, respectively. Superscripts indicate server-side ($s$) or client-side ($c$) computation or communication.
Figure 3: PiPar using single and multiple devices. Comp, $f$, $b$, $u$ and $d$ represent computation, forward pass, backward pass, upload and download, respectively. The superscripts $s_k$ and $c_k$ represent the index of the model $M^{s_k}$ and $M^{c_k}$, $k=1,2$, respectively.
Figure 4: Training time per epoch for FL, SFL and PiPar under different network conditions for small DNNs.
Figure 5: Training time per epoch for SFL and PiPar under different network conditions for large DNNs. FL results are not shown as the entire DNN does not fit on the device memory.
...and 9 more figures

PiPar: Pipeline Parallelism for Collaborative Machine Learning

TL;DR

Abstract

PiPar: Pipeline Parallelism for Collaborative Machine Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (14)