Table of Contents
Fetching ...

Unleashing the Power of Preemptive Priority-based Scheduling for Real-Time GPU Tasks

Yidi Wang, Cong Liu, Daniel Wong, Hyoseung Kim

TL;DR

This work tackles the challenge of guaranteeing real-time performance for GPU-using tasks by introducing two driver-level approaches—kernel-thread and IOCTL-based—to enable preemptive, priority-driven GPU scheduling on Nvidia Tegra GPUs. It provides formal end-to-end response-time analyses for both methods, including a GPU-priority assignment mechanism (Audsley-based) and a reduced-pessimism enhancement that accounts for overlaps between CPU and GPU segments. Through extensive simulations and real-hardware case studies on Jetson Xavier NX and Orin Nano, the authors demonstrate substantial schedulability improvements (up to ~40%) and superior predictability compared with traditional synchronization-based strategies. The results underscore the practical impact of device-driver level control for real-time GPU tasks in embedded platforms, and the work is released as open source for broader adoption and further refinement.

Abstract

Scheduling real-time tasks that utilize GPUs with analyzable guarantees poses a significant challenge due to the intricate interaction between CPU and GPU resources, as well as the complex GPU hardware and software stack. While much research has been conducted in the real-time research community, several limitations persist, including the absence or limited availability of preemption, extended blocking times, and/or the need for extensive modifications to program code. In this paper, we propose two novel techniques, namely the kernel thread and IOCTL-based approaches, to enable preemptive priority-based scheduling for real-time GPU tasks. Our approaches exert control over GPU context scheduling at the device driver level and enable preemptive GPU scheduling based on task priorities. The kernel thread-based approach achieves this without requiring modifications to user-level programs, while the IOCTL-based approach needs only a single macro at the boundaries of GPU access segments. In addition, we provide a comprehensive response time analysis that takes into account overlaps between different task segments, mitigating pessimism in worst-case estimates. Through empirical evaluations and case studies, we demonstrate the effectiveness of the proposed approaches in improving taskset schedulability and timeliness of real-time tasks. The results highlight significant improvements over prior work, with up to 40\% higher schedulability, while also achieving predictable worst-case behavior on Nvidia Jetson embedded platforms.

Unleashing the Power of Preemptive Priority-based Scheduling for Real-Time GPU Tasks

TL;DR

This work tackles the challenge of guaranteeing real-time performance for GPU-using tasks by introducing two driver-level approaches—kernel-thread and IOCTL-based—to enable preemptive, priority-driven GPU scheduling on Nvidia Tegra GPUs. It provides formal end-to-end response-time analyses for both methods, including a GPU-priority assignment mechanism (Audsley-based) and a reduced-pessimism enhancement that accounts for overlaps between CPU and GPU segments. Through extensive simulations and real-hardware case studies on Jetson Xavier NX and Orin Nano, the authors demonstrate substantial schedulability improvements (up to ~40%) and superior predictability compared with traditional synchronization-based strategies. The results underscore the practical impact of device-driver level control for real-time GPU tasks in embedded platforms, and the work is released as open source for broader adoption and further refinement.

Abstract

Scheduling real-time tasks that utilize GPUs with analyzable guarantees poses a significant challenge due to the intricate interaction between CPU and GPU resources, as well as the complex GPU hardware and software stack. While much research has been conducted in the real-time research community, several limitations persist, including the absence or limited availability of preemption, extended blocking times, and/or the need for extensive modifications to program code. In this paper, we propose two novel techniques, namely the kernel thread and IOCTL-based approaches, to enable preemptive priority-based scheduling for real-time GPU tasks. Our approaches exert control over GPU context scheduling at the device driver level and enable preemptive GPU scheduling based on task priorities. The kernel thread-based approach achieves this without requiring modifications to user-level programs, while the IOCTL-based approach needs only a single macro at the boundaries of GPU access segments. In addition, we provide a comprehensive response time analysis that takes into account overlaps between different task segments, mitigating pessimism in worst-case estimates. Through empirical evaluations and case studies, we demonstrate the effectiveness of the proposed approaches in improving taskset schedulability and timeliness of real-time tasks. The results highlight significant improvements over prior work, with up to 40\% higher schedulability, while also achieving predictable worst-case behavior on Nvidia Jetson embedded platforms.
Paper Structure (21 sections, 7 theorems, 15 equations, 18 figures, 5 tables, 2 algorithms)

This paper contains 21 sections, 7 theorems, 15 equations, 18 figures, 5 tables, 2 algorithms.

Key Result

Lemma 1

The runlist update delay from the kernel thread for a job of task $\tau_i$ is upper-bounded by: where , $\epsilon$ is the runlist update time (Sec. sec:kernel_thread_approach), $R_i$ is the worst-case response time of $\tau_i$, $hp(\tau_i)$ is a set of all the higher-priority tasks than $\tau_i$ in the system, and $J_h=R_h-(C_h+G_h)$ is the release jitter to capture the carry-in effect.

Figures (18)

  • Figure 1: Runlist and time-sliced GPU scheduling
  • Figure 2: Task model example
  • Figure 3: Example schedule of three tasks under different approaches (priority $\tau_1 > \tau_2 > \tau_3$)
  • Figure 4: Example schedule of three tasks with runlist update delay (task priority: $\tau_1 > \tau_2 > \tau_3$)
  • Figure 5: Preemption by GPU segments on CPU tasks under busy-waiting mode, kernel thread approach as an example
  • ...and 13 more figures

Theorems & Definitions (16)

  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Lemma 4
  • proof
  • Definition 1: Completion time
  • Definition 2: Full overlap
  • ...and 6 more