Table of Contents
Fetching ...

Poster: Flexible Scheduling of Network and Computing Resources for Distributed AI Tasks

Ruikun Wang, Jiawei Zhang, Qiaolun Zhang, Bojun Zhang, Zhiqun Gu, Aryanaz Attarpour, Yuefeng Ji, Massimo Tornatore

TL;DR

Problem addressed: distributed AI tasks require coordinated scheduling of network routes and computing resources to handle large model updates. Approach: a flexible MST-based scheduler selects routing paths and aggregation points for broadcast and upload in a programmable testbed, compared against a fixed SPFF baseline. Key contributions: MST-based scheduling reduces latency and bandwidth usage for multiple local models and highlights open challenges in scheduling, RDMA-based protocols, and all-optical network architectures. Impact: the strategy provides a practical pathway to improve communication efficiency for distributed AI in telecom/cloud environments.

Abstract

Many emerging Artificial Intelligence (AI) applications require on-demand provisioning of large-scale computing, which can only be enabled by leveraging distributed computing services interconnected through networking. To address such increasing demand for networking to serve AI tasks, we investigate new scheduling strategies to improve communication efficiency and test them on a programmable testbed. We also show relevant challenges and research directions.

Poster: Flexible Scheduling of Network and Computing Resources for Distributed AI Tasks

TL;DR

Problem addressed: distributed AI tasks require coordinated scheduling of network routes and computing resources to handle large model updates. Approach: a flexible MST-based scheduler selects routing paths and aggregation points for broadcast and upload in a programmable testbed, compared against a fixed SPFF baseline. Key contributions: MST-based scheduling reduces latency and bandwidth usage for multiple local models and highlights open challenges in scheduling, RDMA-based protocols, and all-optical network architectures. Impact: the strategy provides a practical pathway to improve communication efficiency for distributed AI in telecom/cloud environments.

Abstract

Many emerging Artificial Intelligence (AI) applications require on-demand provisioning of large-scale computing, which can only be enabled by leveraging distributed computing services interconnected through networking. To address such increasing demand for networking to serve AI tasks, we investigate new scheduling strategies to improve communication efficiency and test them on a programmable testbed. We also show relevant challenges and research directions.
Paper Structure (4 sections, 3 figures)

This paper contains 4 sections, 3 figures.

Figures (3)

  • Figure 1: A comparison of fixed and flexible scheduler.
  • Figure 2: Experimental framework and procedures.
  • Figure 3: Evaluation results.