Unleashing the Power of Preemptive Priority-based Scheduling for Real-Time GPU Tasks
Yidi Wang, Cong Liu, Daniel Wong, Hyoseung Kim
TL;DR
This work tackles the challenge of guaranteeing real-time performance for GPU-using tasks by introducing two driver-level approaches—kernel-thread and IOCTL-based—to enable preemptive, priority-driven GPU scheduling on Nvidia Tegra GPUs. It provides formal end-to-end response-time analyses for both methods, including a GPU-priority assignment mechanism (Audsley-based) and a reduced-pessimism enhancement that accounts for overlaps between CPU and GPU segments. Through extensive simulations and real-hardware case studies on Jetson Xavier NX and Orin Nano, the authors demonstrate substantial schedulability improvements (up to ~40%) and superior predictability compared with traditional synchronization-based strategies. The results underscore the practical impact of device-driver level control for real-time GPU tasks in embedded platforms, and the work is released as open source for broader adoption and further refinement.
Abstract
Scheduling real-time tasks that utilize GPUs with analyzable guarantees poses a significant challenge due to the intricate interaction between CPU and GPU resources, as well as the complex GPU hardware and software stack. While much research has been conducted in the real-time research community, several limitations persist, including the absence or limited availability of preemption, extended blocking times, and/or the need for extensive modifications to program code. In this paper, we propose two novel techniques, namely the kernel thread and IOCTL-based approaches, to enable preemptive priority-based scheduling for real-time GPU tasks. Our approaches exert control over GPU context scheduling at the device driver level and enable preemptive GPU scheduling based on task priorities. The kernel thread-based approach achieves this without requiring modifications to user-level programs, while the IOCTL-based approach needs only a single macro at the boundaries of GPU access segments. In addition, we provide a comprehensive response time analysis that takes into account overlaps between different task segments, mitigating pessimism in worst-case estimates. Through empirical evaluations and case studies, we demonstrate the effectiveness of the proposed approaches in improving taskset schedulability and timeliness of real-time tasks. The results highlight significant improvements over prior work, with up to 40\% higher schedulability, while also achieving predictable worst-case behavior on Nvidia Jetson embedded platforms.
