A Performance Verification Methodology for Resource Allocation Heuristics
Saksham Goel, Benjamin Mikek, Jehad Aly, Venkat Arun, Ahmed Saeed, Aditya Akella
TL;DR
This paper tackles the problem of robustly evaluating resource-allocation heuristics under realistic, stress-driven conditions. It introduces Virelay, a general SMT-based framework that models heuristics and systems through a minimal-assumption, assumption-constrained worst-case analysis with a simple state-transition structure. Through six case studies, including Work Stealing and the Linux CFS load balancer, the authors derive actionable bounds, reveal real-world bugs, and replicate prior work to validate the framework’s breadth. The work demonstrates that formal performance verification can uncover both theoretical limits and practical issues, enabling faster, safer, and more reliable design of scheduling systems across domains.
Abstract
Performance verification is a nascent but promising tool for understanding the performance and limitations of heuristics under realistic assumptions. Bespoke performance verification tools have already demonstrated their value in settings like congestion control and packet scheduling. In this paper, we aim to emphasize the broad applicability and utility of performance verification. To that end, we highlight the design principles of performance verification. Then, we leverage that understanding to develop a set of easy-to-follow guidelines that are applicable to a wide range of resource allocation heuristics. In particular, we introduce Virelay, a framework that enables heuristic designers to express the behavior of their algorithms and their assumptions about the system in an environment that resembles a discrete-event simulator. We demonstrate the utility and ease-of-use of Virelay by applying it to six diverse case studies. We produce bounds on the performance of classical algorithms, work stealing and SRPT scheduling, under practical assumptions. We demonstrate Virelay's expressiveness by capturing existing models for congestion control and packet scheduling, and we verify the observation that TCP unfairness can cause some ML training workloads to spontaneously converge to a state of high network utilization. Finally, we use Virelay to identify two bugs in the Linux CFS load balancer.
