Table of Contents
Fetching ...

MP-SL: Multihop Parallel Split Learning

Joana Tirana, Spyros Lalis, Dimitris Chatzopoulos

TL;DR

MP-SL addresses memory and heterogeneity challenges in federated and split learning by introducing a multihop, pipelined split learning framework that partitions the model into P parts across N compute nodes, enabling resource-constrained data owners to participate. It couples an ILP-based split-point optimization with a formal training cost model to minimize pipeline latency, while preserving privacy by no-label sharing. Empirical results show memory reductions up to $76\%$ per compute node, epoch-time estimates with errors below $3.86\%$, and substantial cost savings when using cheaper compute nodes, with robustness to stragglers and network heterogeneity. The framework is modular and available as an MLaaS solution, and future work includes combining pipeline parallelism with horizontal scaling to further accelerate large-scale deployments.

Abstract

Federated Learning (FL) stands out as a widely adopted protocol facilitating the training of Machine Learning (ML) models while maintaining decentralized data. However, challenges arise when dealing with a heterogeneous set of participating devices, causing delays in the training process, particularly among devices with limited resources. Moreover, the task of training ML models with a vast number of parameters demands computing and memory resources beyond the capabilities of small devices, such as mobile and Internet of Things (IoT) devices. To address these issues, techniques like Parallel Split Learning (SL) have been introduced, allowing multiple resource-constrained devices to actively participate in collaborative training processes with assistance from resourceful compute nodes. Nonetheless, a drawback of Parallel SL is the substantial memory allocation required at the compute nodes, for instance training VGG-19 with 100 participants needs 80 GB. In this paper, we introduce Multihop Parallel SL (MP-SL), a modular and extensible ML as a Service (MLaaS) framework designed to facilitate the involvement of resource-constrained devices in collaborative and distributed ML model training. Notably, to alleviate memory demands per compute node, MP-SL supports multihop Parallel SL-based training. This involves splitting the model into multiple parts and utilizing multiple compute nodes in a pipelined manner. Extensive experimentation validates MP-SL's capability to handle system heterogeneity, demonstrating that the multihop configuration proves more efficient than horizontally scaled one-hop Parallel SL setups, especially in scenarios involving more cost-effective compute nodes.

MP-SL: Multihop Parallel Split Learning

TL;DR

MP-SL addresses memory and heterogeneity challenges in federated and split learning by introducing a multihop, pipelined split learning framework that partitions the model into P parts across N compute nodes, enabling resource-constrained data owners to participate. It couples an ILP-based split-point optimization with a formal training cost model to minimize pipeline latency, while preserving privacy by no-label sharing. Empirical results show memory reductions up to per compute node, epoch-time estimates with errors below , and substantial cost savings when using cheaper compute nodes, with robustness to stragglers and network heterogeneity. The framework is modular and available as an MLaaS solution, and future work includes combining pipeline parallelism with horizontal scaling to further accelerate large-scale deployments.

Abstract

Federated Learning (FL) stands out as a widely adopted protocol facilitating the training of Machine Learning (ML) models while maintaining decentralized data. However, challenges arise when dealing with a heterogeneous set of participating devices, causing delays in the training process, particularly among devices with limited resources. Moreover, the task of training ML models with a vast number of parameters demands computing and memory resources beyond the capabilities of small devices, such as mobile and Internet of Things (IoT) devices. To address these issues, techniques like Parallel Split Learning (SL) have been introduced, allowing multiple resource-constrained devices to actively participate in collaborative training processes with assistance from resourceful compute nodes. Nonetheless, a drawback of Parallel SL is the substantial memory allocation required at the compute nodes, for instance training VGG-19 with 100 participants needs 80 GB. In this paper, we introduce Multihop Parallel SL (MP-SL), a modular and extensible ML as a Service (MLaaS) framework designed to facilitate the involvement of resource-constrained devices in collaborative and distributed ML model training. Notably, to alleviate memory demands per compute node, MP-SL supports multihop Parallel SL-based training. This involves splitting the model into multiple parts and utilizing multiple compute nodes in a pipelined manner. Extensive experimentation validates MP-SL's capability to handle system heterogeneity, demonstrating that the multihop configuration proves more efficient than horizontally scaled one-hop Parallel SL setups, especially in scenarios involving more cost-effective compute nodes.
Paper Structure (19 sections, 16 equations, 14 figures, 2 tables)

This paper contains 19 sections, 16 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Memory usage measured on a Raspberry Pi 4, when training ResNet-101 and VGG-19, with FL and MP-SL.
  • Figure 2: Applying SL requires (a) model split and (b) communication with at least one compute node. When more than one data owners participate (c) Parallel SL can ensure scalability by allowing data owners to make model updates independently.
  • Figure 3: Memory demand for the compute node with the largest (memory-wise) model part for different multihop levels. The smallest multihop level is $3$ (i.e., one compute node), and the largest is $6$ (i.e., four compute nodes). Also, for each model, we select different user-defined first and last cut layers. Note that VGG19 has $25$ indivisible layers while ResNet101 has $37$.
  • Figure 4: MP-SL protocol with two compute nodes.
  • Figure 5: Task serialization and message encapsulation.
  • ...and 9 more figures