Table of Contents
Fetching ...

gpu_ext: Extensible OS Policies for GPUs via eBPF

Yusheng Zheng, Tong Yu, Yiwei Yang, Minghui Jiang, Xiangyu Gao, Jianchang Su, Yanpeng Hu, Wenan Mao, Wei Zhang, Dan Williams, Andi Quinn

TL;DR

<3-5 sentence high-level summary>

Abstract

Performance in modern GPU-centric systems increasingly depends on resource management policies, including memory placement, scheduling, and observability. However, uniform policies typically yield suboptimal performance across diverse workloads. Existing approaches present a tradeoff: user-space runtimes provide programmability and flexibility but lack cross-tenant visibility and fine-grained control of hardware resources; meanwhile, modifications to the OS kernel introduce significant complexity and safety risks. To address this, we argue that the GPU driver and device layer should provide an extensible OS interface for policy enforcement. While the emerging eBPF technology shows potential, directly applying existing host-side eBPF is insufficient because they lack visibility and control into critical device-side events, and directly embedding policy code into GPU kernels could compromise safety and efficiency. We propose gpu_ext, an eBPF-based runtime that treats the GPU driver and device as a programmable OS subsystem. gpu_ext extends GPU drivers by exposing safe programmable hooks and introduces a device-side eBPF runtime capable of executing verified policy logic within GPU kernels, enabling coherent and transparent policies. Evaluation across realistic workloads including inference, training, and vector search demonstrates that gpu_ext improves throughput by up to 4.8x and reduces tail latency by up to 2x, incurring low overhead, without modifying or restarting applications

gpu_ext: Extensible OS Policies for GPUs via eBPF

TL;DR

<3-5 sentence high-level summary>

Abstract

Performance in modern GPU-centric systems increasingly depends on resource management policies, including memory placement, scheduling, and observability. However, uniform policies typically yield suboptimal performance across diverse workloads. Existing approaches present a tradeoff: user-space runtimes provide programmability and flexibility but lack cross-tenant visibility and fine-grained control of hardware resources; meanwhile, modifications to the OS kernel introduce significant complexity and safety risks. To address this, we argue that the GPU driver and device layer should provide an extensible OS interface for policy enforcement. While the emerging eBPF technology shows potential, directly applying existing host-side eBPF is insufficient because they lack visibility and control into critical device-side events, and directly embedding policy code into GPU kernels could compromise safety and efficiency. We propose gpu_ext, an eBPF-based runtime that treats the GPU driver and device as a programmable OS subsystem. gpu_ext extends GPU drivers by exposing safe programmable hooks and introduces a device-side eBPF runtime capable of executing verified policy logic within GPU kernels, enabling coherent and transparent policies. Evaluation across realistic workloads including inference, training, and vector search demonstrates that gpu_ext improves throughput by up to 4.8x and reduces tail latency by up to 2x, incurring low overhead, without modifying or restarting applications

Paper Structure

This paper contains 60 sections, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Memory access (page fault) patterns vary across GPU workloads: faiss Build exhibits sequential scans; faiss Query random accesses; llama.cpp MoE Prefill shows periodic sequential patterns while Decode exhibits sparse random accesses; PyTorch DNN shows periodic block accesses.
  • Figure 2: GPU thread scheduling imbalance observed via eBPF tracing. (a) SM load distribution shows 127$\times$ imbalance (SM 15: 382 threads vs SM 6: 3 threads). (b) Warp activity heatmap reveals SM 0 concentrates work in high-numbered warps while other SMs underutilize warp slots.
  • Figure 3: gpu_ext architecture: cross-layer eBPF runtime spanning kernel driver and GPU device. The control plane deploys policies via (a) bpf syscall to verifier and driver hooks (b) GPU JIT after verification; (c) DBI injects trampolines into GPU kernels. Shared maps enable state coordination across layers.
  • Figure 4: gpu_ext block-scheduling policies across workload regimes. No single policy dominates: always-steal works well under moderate imbalance (a) but becomes pathological under clustered heavy tails (b), where LatencyBudget matches baseline performance.
  • Figure 5: Prefill (pp512) and decode (tg128) throughput for GPT-OSS-120B MoE (59 GiB) on RTX 5090 (32GB). gpu_ext eBPF prefetching achieves 4.8$\times$ speedup on decode (memory-bound) over framework offloading while maintaining competitive prefill performance.
  • ...and 7 more figures