Equilibria: Fair Multi-Tenant CXL Memory Tiering At Scale
Kaiyang Zhao, Neha Gholkar, Hasan Maruf, Abhishek Dhanotia, Johannes Weiner, Gregory Price, Ning Sun, Bhavya Dwivedi, Stuart Clark, Dimitrios Skarlatos
TL;DR
Equilibria tackles the challenge of fair, multi-tenant memory tiering in CXL-enabled datacenters by implementing an OS-level framework that provides per-container observability, regulated promotion/demotion, and thrashing mitigation. It defines a formal notion of fairness using lower memory protection and an upper bound to guarantee each tenant sufficient fast memory while enabling work-conserving bursts and safe capacity planning. Through real-world deployment learnings, comprehensive evaluation on production workloads and benchmarks, and upstream Linux integration, Equilibria demonstrates up to 52% production improvement and 1.7x benchmark gains over baselines, while preserving SLOs and reducing interference. The work offers practical impact by delivering an upstreamable, scalable solution that makes CXL memory viable for multi-tenant production environments.
Abstract
Memory dominates datacenter system cost and power. Memory expansion via Compute Express Link (CXL) is an effective way to provide additional memory at lower cost and power, but its effective use requires software-level tiering for hyperscaler workloads. Existing tiering solutions, including current Linux support, face fundamental limitations in production deployments. First, they lack multi-tenancy support, failing to handle stacked homogeneous or heterogeneous workloads. Second, limited control-plane flexibility leads to fairness violations and performance variability. Finally, insufficient observability prevents operators from diagnosing performance pathologies at scale. We present Equilibria, an OS framework enabling fair, multi-tenant CXL tiering at datacenter scale. Equilibria provides per-container controls for memory fair-share allocation and fine-grained observability of tiered-memory usage and operations. It further enforces flexible, user-specified fairness policies through regulated promotion and demotion, and mitigates noisy-neighbor interference by suppressing thrashing. Evaluated in a large hyperscaler fleet using production workloads and benchmarks, Equilibria helps workloads meet service level objectives (SLOs) while avoiding performance interference. It improves performance over the state-of-the-art Linux solution, TPP, by up to 52% for production workloads and 1.7x for benchmarks. All Equilibria patches have been released to the Linux community.
