LithOS: An Operating System for Efficient Machine Learning on GPUs
Patrick H. Coppock, Brian Zhang, Eliot H. Solomon, Vasilis Kypriotis, Leon Yang, Bikash Sharma, Dan Schatzberg, Todd C. Mowry, Dimitrios Skarlatos
TL;DR
LithOS introduces an OS-style layer for GPUs that transparently interposes in the CUDA driver to provide fine-grained, per-TPC scheduling, kernel atomization, and dynamic right-sizing and DVFS. It presents four core mechanisms—TPC Scheduler, Kernel Atomizer, Right-Sizing, and Transparent Power Management—combined with an online latency predictor to maximize GPU utilization while preserving isolation for multitenant ML workloads. Experimental results show up to 13x tail-latency reduction in inference and 4.7x in mixed workloads, with substantial throughput gains and meaningful energy savings (up to 46% in DVFS) under modest performance overhead. The approach demonstrates significant improvements over MPS and SotA transparent schemes, highlighting LithOS as a foundational step toward an operating system for GPUs that can scale with datacenter ML demands.
Abstract
The surging demand for GPUs in datacenters for machine learning (ML) has made efficient GPU utilization crucial. However, meeting the diverse needs of ML models while optimizing resource usage is challenging. To enable transparent, fine-grained GPU management that maximizes utilization and energy efficiency while maintaining strong isolation, an operating system (OS) approach is needed. This paper introduces LithOS, a first step toward a GPU OS. LithOS includes the following new abstractions and mechanisms for efficient GPU resource management: (i) a novel TPC Scheduler that supports spatial scheduling at the granularity of individual TPCs, unlocking efficient TPC stealing between workloads; (ii) transparent kernel atomization to reduce head-of-line blocking and enable dynamic resource reallocation mid-execution; (iii) a lightweight hardware right-sizing mechanism that determines the minimal TPC resources needed per atom; and (iv) a transparent power management mechanism that reduces power consumption based on in-flight work behavior. We implement LithOS in Rust and evaluate its performance across extensive ML environments, comparing it to state-of-the-art solutions from NVIDIA and prior research. For inference stacking, LithOS reduces tail latencies by 13x compared to MPS; compared to the best SotA, it reduces tail latencies by 3x while improving aggregate throughput by 1.6x. In hybrid inference-training stacking, LithOS reduces tail latencies by 4.7x compared to MPS; compared to the best SotA, it reduces tail latencies 1.18x while improving aggregate throughput by 1.35x. Finally, for a modest performance hit under 4%, LithOS's right-sizing provides a quarter of GPU capacity savings on average, while for a 7% hit, its power management yields a quarter of a GPU's energy savings. Overall, LithOS increases GPU efficiency, establishing a foundation for future OS research on GPUs.
