Table of Contents
Fetching ...

Deep Reinforcement Learning based Online Scheduling Policy for Deep Neural Network Multi-Tenant Multi-Accelerator Systems

Francesco G. Blanco, Enrico Russo, Maurizio Palesi, Davide Patti, Giuseppe Ascia, Vincenzo Catania

TL;DR

The paper addresses online scheduling of multi-tenant DNN inferences on heterogeneous multi-accelerator systems to meet strict SLA deadlines. It proposes RELMAS, a DRL-based scheduler that integrates an LSTM policy with Deep Deterministic Policy Gradient (DDPG), processing per-layer information including latencies $c^m_{i,s}$ and bandwidths $b^m_{i,s}$, and operating in periodic cycles of length $T_s$. By encoding a rich state and shaping rewards to balance deadline adherence with bandwidth efficiency, RELMAS achieves substantial SLA improvements (up to $173\%$ in some scenarios) with energy overhead under $1.5\%$, outperforming static and heuristic baselines across diverse workloads. The approach demonstrates practical viability for cloud-based, multi-tenant DNN inference, enabling better SLA satisfaction and hardware utilization in mixed-workload environments.

Abstract

Currently, there is a growing trend of outsourcing the execution of DNNs to cloud services. For service providers, managing multi-tenancy and ensuring high-quality service delivery, particularly in meeting stringent execution time constraints, assumes paramount importance, all while endeavoring to maintain cost-effectiveness. In this context, the utilization of heterogeneous multi-accelerator systems becomes increasingly relevant. This paper presents RELMAS, a low-overhead deep reinforcement learning algorithm designed for the online scheduling of DNNs in multi-tenant environments, taking into account the dataflow heterogeneity of accelerators and memory bandwidths contentions. By doing so, service providers can employ the most efficient scheduling policy for user requests, optimizing Service-Level-Agreement (SLA) satisfaction rates and enhancing hardware utilization. The application of RELMAS to a heterogeneous multi-accelerator system composed of various instances of Simba and Eyeriss sub-accelerators resulted in up to a 173% improvement in SLA satisfaction rate compared to state-of-the-art scheduling techniques across different workload scenarios, with less than a 1.5% energy overhead.

Deep Reinforcement Learning based Online Scheduling Policy for Deep Neural Network Multi-Tenant Multi-Accelerator Systems

TL;DR

The paper addresses online scheduling of multi-tenant DNN inferences on heterogeneous multi-accelerator systems to meet strict SLA deadlines. It proposes RELMAS, a DRL-based scheduler that integrates an LSTM policy with Deep Deterministic Policy Gradient (DDPG), processing per-layer information including latencies and bandwidths , and operating in periodic cycles of length . By encoding a rich state and shaping rewards to balance deadline adherence with bandwidth efficiency, RELMAS achieves substantial SLA improvements (up to in some scenarios) with energy overhead under , outperforming static and heuristic baselines across diverse workloads. The approach demonstrates practical viability for cloud-based, multi-tenant DNN inference, enabling better SLA satisfaction and hardware utilization in mixed-workload environments.

Abstract

Currently, there is a growing trend of outsourcing the execution of DNNs to cloud services. For service providers, managing multi-tenancy and ensuring high-quality service delivery, particularly in meeting stringent execution time constraints, assumes paramount importance, all while endeavoring to maintain cost-effectiveness. In this context, the utilization of heterogeneous multi-accelerator systems becomes increasingly relevant. This paper presents RELMAS, a low-overhead deep reinforcement learning algorithm designed for the online scheduling of DNNs in multi-tenant environments, taking into account the dataflow heterogeneity of accelerators and memory bandwidths contentions. By doing so, service providers can employ the most efficient scheduling policy for user requests, optimizing Service-Level-Agreement (SLA) satisfaction rates and enhancing hardware utilization. The application of RELMAS to a heterogeneous multi-accelerator system composed of various instances of Simba and Eyeriss sub-accelerators resulted in up to a 173% improvement in SLA satisfaction rate compared to state-of-the-art scheduling techniques across different workload scenarios, with less than a 1.5% energy overhead.
Paper Structure (11 sections, 3 equations, 5 figures, 2 tables)

This paper contains 11 sections, 3 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Diagram of the reference Multi-Accelerator Heterogeneous Architecture used in evaluations.
  • Figure 2: Overview of the proposed online scheduler in production and learning phases.
  • Figure 3: SLA Satisfaction Rate comparison against other baselines for different workload sets.
  • Figure 4: Impact of memory bandwidth reduction on SLA Satisfaction Rate for different scheduling strategies.
  • Figure 5: Energy overhead of the proposed scheduling algorithm varying the hidden size of the LSTM policy and the scheduling period.