Table of Contents
Fetching ...

Venn: Resource Management for Collaborative Learning Jobs

Jiachen Liu, Fan Lai, Ding Ding, Yiwen Zhang, Mosharaf Chowdhury

TL;DR

Venn tackles resource contention in collaborative learning across ephemeral, heterogeneous edge devices. It formulates the Intersection Resource Scheduling (IRS) problem and deploys a contention-aware scheduling heuristic, complemented by a tier-based device-to-job matching strategy to reduce tail response times. Evaluations show up to 1.88× improvement in average job completion time (JCT) over state-of-the-art baselines across real-world CL workloads, along with dynamic resource supply handling and starvation prevention. The approach demonstrates scalable, robust CL resource management for large-scale edge deployments and provides open-source code for practical adoption.

Abstract

In recent years, collaborative learning (CL) has emerged as a promising approach for machine learning (ML) and data science across distributed edge devices. As the deployment of CL jobs increases, they inevitably contend for limited resources. However, efficient resource scheduling in this context is challenging because of the ephemeral nature and resource heterogeneity of devices, coupled with the overlapping resource requirements of diverse CL jobs. Existing resource managers often assign devices to CL jobs randomly for simplicity and scalability, but this approach compromises job efficiency. In this paper, we present Venn, a CL resource manager that efficiently schedules ephemeral, heterogeneous devices among multiple CL jobs to reduce the average job completion time (JCT). Venn formulates the Intersection Resource Scheduling (IRS) problem to identify complex resource contention among multiple CL jobs. It then proposes a contention-aware scheduling heuristic to minimize the average scheduling delay. Furthermore, it proposes a resource-aware device-to-job matching heuristic to optimize response collection time by mitigating stragglers. Our evaluation shows that, compared to the state-of-the-art CL resource managers, Venn improves the average JCT by up to 1.88x. The code is available at https://github.com/SymbioticLab/Venn.

Venn: Resource Management for Collaborative Learning Jobs

TL;DR

Venn tackles resource contention in collaborative learning across ephemeral, heterogeneous edge devices. It formulates the Intersection Resource Scheduling (IRS) problem and deploys a contention-aware scheduling heuristic, complemented by a tier-based device-to-job matching strategy to reduce tail response times. Evaluations show up to 1.88× improvement in average job completion time (JCT) over state-of-the-art baselines across real-world CL workloads, along with dynamic resource supply handling and starvation prevention. The approach demonstrates scalable, robust CL resource management for large-scale edge deployments and provides open-source code for practical adoption.

Abstract

In recent years, collaborative learning (CL) has emerged as a promising approach for machine learning (ML) and data science across distributed edge devices. As the deployment of CL jobs increases, they inevitably contend for limited resources. However, efficient resource scheduling in this context is challenging because of the ephemeral nature and resource heterogeneity of devices, coupled with the overlapping resource requirements of diverse CL jobs. Existing resource managers often assign devices to CL jobs randomly for simplicity and scalability, but this approach compromises job efficiency. In this paper, we present Venn, a CL resource manager that efficiently schedules ephemeral, heterogeneous devices among multiple CL jobs to reduce the average job completion time (JCT). Venn formulates the Intersection Resource Scheduling (IRS) problem to identify complex resource contention among multiple CL jobs. It then proposes a contention-aware scheduling heuristic to minimize the average scheduling delay. Furthermore, it proposes a resource-aware device-to-job matching heuristic to optimize response collection time by mitigating stragglers. Our evaluation shows that, compared to the state-of-the-art CL resource managers, Venn improves the average JCT by up to 1.88x. The code is available at https://github.com/SymbioticLab/Venn.
Paper Structure (42 sections, 2 theorems, 5 equations, 15 figures, 4 tables, 2 algorithms)

This paper contains 42 sections, 2 theorems, 5 equations, 15 figures, 4 tables, 2 algorithms.

Key Result

Lemma 1

Given a diverse set of CL jobs with one round request, if jobs are scheduled optimally in terms of the average JCT, first within each job group and then across job groups, the resulting average JCT is optimal.

Figures (15)

  • Figure 1: Composition of the completion time of one round of a CL job.
  • Figure 2: CL resources exhibit both high variance in availability and capacity.
  • Figure 3: Toy example showing three resource schedules across multiple CL jobs. Job demands and resource eligibility are shown in the top row. Devices check in at a constant rate. Eligible devices only for Emoji jobs are marked with blue; all devices are eligible for the Keyboard job. The label of each client indicates its job assignment. Random Matching and SRSF inefficiently allocate scarce Emoji-eligible devices to Job 1, which already has sufficient Keyboard-eligible resources. In contrast, the optimal schedule allocates these scarce resources to Job 2 followed by Job 3, minimizing the average JCT.
  • Figure 4: Impact of resource contention.
  • Figure 5: JCT breakdown in a single round.
  • ...and 10 more figures

Theorems & Definitions (4)

  • Lemma 1
  • proof
  • Lemma 2
  • proof