Table of Contents
Fetching ...

Slim Scheduler: A Runtime-Aware RL and Scheduler System for Efficient CNN Inference

Ian Harshbarger, Calvin Chidambaram

TL;DR

The paper tackles runtime scheduling for segmented, slimmable CNNs deployed across heterogeneous GPUs, aiming to improve throughput under fluctuating loads. It introduces Slim Scheduler, a hybrid framework combining a local greedy segment scheduler with a global PPO router that jointly selects device, width, and batching using telemetry-driven rewards. The approach yields large reductions in latency and energy while maintaining competitive accuracy, with trade-offs in latency/energy variance dependent on reward weighting. The results on CIFAR-100 with a SlimResNet backbone show the policy generalizes across hardware, demonstrating practical potential for scalable, resource-aware multi-device inference. Overall, the work demonstrates that integrating algorithmic batching with reinforcement learning can adaptively balance efficiency and robustness in real-time distributed inference.

Abstract

Most neural network scheduling research focuses on optimizing static, end-to-end models of fixed width, overlooking dynamic approaches that adapt to heterogeneous hardware and fluctuating runtime conditions. We present Slim Scheduler, a hybrid scheduling framework that integrates a Proximal Policy Optimization (PPO) reinforcement learning policy with algorithmic, greedy schedulers to coordinate distributed inference for slimmable models. Each server runs a local greedy scheduler that batches compatible requests and manages instance scaling based on VRAM and utilization constraints, while the PPO router learns global routing policies for device selection, width ratio, and batch configuration. This hierarchical design reduces search space complexity, mitigates overfitting to specific hardware, and balances efficiency and throughput. Compared to a purely randomized task distribution baseline, Slim Scheduler can achieve various accuracy and latency trade-offs such as: A 96.45% reduction in mean latency and a 97.31% reduction in energy usage dropping accuracy to the slimmest model available (70.3%). It can then accomplish an overall reduction in average latency plus energy consumption with an increase in accuracy at the cost of higher standard deviations of said latency and energy, effecting overall task throughput.

Slim Scheduler: A Runtime-Aware RL and Scheduler System for Efficient CNN Inference

TL;DR

The paper tackles runtime scheduling for segmented, slimmable CNNs deployed across heterogeneous GPUs, aiming to improve throughput under fluctuating loads. It introduces Slim Scheduler, a hybrid framework combining a local greedy segment scheduler with a global PPO router that jointly selects device, width, and batching using telemetry-driven rewards. The approach yields large reductions in latency and energy while maintaining competitive accuracy, with trade-offs in latency/energy variance dependent on reward weighting. The results on CIFAR-100 with a SlimResNet backbone show the policy generalizes across hardware, demonstrating practical potential for scalable, resource-aware multi-device inference. Overall, the work demonstrates that integrating algorithmic batching with reinforcement learning can adaptively balance efficiency and robustness in real-time distributed inference.

Abstract

Most neural network scheduling research focuses on optimizing static, end-to-end models of fixed width, overlooking dynamic approaches that adapt to heterogeneous hardware and fluctuating runtime conditions. We present Slim Scheduler, a hybrid scheduling framework that integrates a Proximal Policy Optimization (PPO) reinforcement learning policy with algorithmic, greedy schedulers to coordinate distributed inference for slimmable models. Each server runs a local greedy scheduler that batches compatible requests and manages instance scaling based on VRAM and utilization constraints, while the PPO router learns global routing policies for device selection, width ratio, and batch configuration. This hierarchical design reduces search space complexity, mitigates overfitting to specific hardware, and balances efficiency and throughput. Compared to a purely randomized task distribution baseline, Slim Scheduler can achieve various accuracy and latency trade-offs such as: A 96.45% reduction in mean latency and a 97.31% reduction in energy usage dropping accuracy to the slimmest model available (70.3%). It can then accomplish an overall reduction in average latency plus energy consumption with an increase in accuracy at the cost of higher standard deviations of said latency and energy, effecting overall task throughput.

Paper Structure

This paper contains 12 sections, 13 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: GPU memory utilization vs. batch size for each segment on the RTX 2080 Ti.
  • Figure 2: Energy consumption vs. GPU utilization for each network on the RTX 2080 Ti.
  • Figure 3: Average latency vs. GPU utilization for each segment on the RTX 2080 Ti.