Table of Contents
Fetching ...

Equilibria: Fair Multi-Tenant CXL Memory Tiering At Scale

Kaiyang Zhao, Neha Gholkar, Hasan Maruf, Abhishek Dhanotia, Johannes Weiner, Gregory Price, Ning Sun, Bhavya Dwivedi, Stuart Clark, Dimitrios Skarlatos

TL;DR

Equilibria tackles the challenge of fair, multi-tenant memory tiering in CXL-enabled datacenters by implementing an OS-level framework that provides per-container observability, regulated promotion/demotion, and thrashing mitigation. It defines a formal notion of fairness using lower memory protection and an upper bound to guarantee each tenant sufficient fast memory while enabling work-conserving bursts and safe capacity planning. Through real-world deployment learnings, comprehensive evaluation on production workloads and benchmarks, and upstream Linux integration, Equilibria demonstrates up to 52% production improvement and 1.7x benchmark gains over baselines, while preserving SLOs and reducing interference. The work offers practical impact by delivering an upstreamable, scalable solution that makes CXL memory viable for multi-tenant production environments.

Abstract

Memory dominates datacenter system cost and power. Memory expansion via Compute Express Link (CXL) is an effective way to provide additional memory at lower cost and power, but its effective use requires software-level tiering for hyperscaler workloads. Existing tiering solutions, including current Linux support, face fundamental limitations in production deployments. First, they lack multi-tenancy support, failing to handle stacked homogeneous or heterogeneous workloads. Second, limited control-plane flexibility leads to fairness violations and performance variability. Finally, insufficient observability prevents operators from diagnosing performance pathologies at scale. We present Equilibria, an OS framework enabling fair, multi-tenant CXL tiering at datacenter scale. Equilibria provides per-container controls for memory fair-share allocation and fine-grained observability of tiered-memory usage and operations. It further enforces flexible, user-specified fairness policies through regulated promotion and demotion, and mitigates noisy-neighbor interference by suppressing thrashing. Evaluated in a large hyperscaler fleet using production workloads and benchmarks, Equilibria helps workloads meet service level objectives (SLOs) while avoiding performance interference. It improves performance over the state-of-the-art Linux solution, TPP, by up to 52% for production workloads and 1.7x for benchmarks. All Equilibria patches have been released to the Linux community.

Equilibria: Fair Multi-Tenant CXL Memory Tiering At Scale

TL;DR

Equilibria tackles the challenge of fair, multi-tenant memory tiering in CXL-enabled datacenters by implementing an OS-level framework that provides per-container observability, regulated promotion/demotion, and thrashing mitigation. It defines a formal notion of fairness using lower memory protection and an upper bound to guarantee each tenant sufficient fast memory while enabling work-conserving bursts and safe capacity planning. Through real-world deployment learnings, comprehensive evaluation on production workloads and benchmarks, and upstream Linux integration, Equilibria demonstrates up to 52% production improvement and 1.7x benchmark gains over baselines, while preserving SLOs and reducing interference. The work offers practical impact by delivering an upstreamable, scalable solution that makes CXL memory viable for multi-tenant production environments.

Abstract

Memory dominates datacenter system cost and power. Memory expansion via Compute Express Link (CXL) is an effective way to provide additional memory at lower cost and power, but its effective use requires software-level tiering for hyperscaler workloads. Existing tiering solutions, including current Linux support, face fundamental limitations in production deployments. First, they lack multi-tenancy support, failing to handle stacked homogeneous or heterogeneous workloads. Second, limited control-plane flexibility leads to fairness violations and performance variability. Finally, insufficient observability prevents operators from diagnosing performance pathologies at scale. We present Equilibria, an OS framework enabling fair, multi-tenant CXL tiering at datacenter scale. Equilibria provides per-container controls for memory fair-share allocation and fine-grained observability of tiered-memory usage and operations. It further enforces flexible, user-specified fairness policies through regulated promotion and demotion, and mitigates noisy-neighbor interference by suppressing thrashing. Evaluated in a large hyperscaler fleet using production workloads and benchmarks, Equilibria helps workloads meet service level objectives (SLOs) while avoiding performance interference. It improves performance over the state-of-the-art Linux solution, TPP, by up to 52% for production workloads and 1.7x for benchmarks. All Equilibria patches have been released to the Linux community.
Paper Structure (35 sections, 2 equations, 9 figures, 4 tables)

This paper contains 35 sections, 2 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Memory as a % of rack TCO and power at corpX's datacenters.
  • Figure 2: Latency versus bandwidth for local, remote, and CXL memory.
  • Figure 3: Container A with hotter access patterns gets all of its footprint in local memory, whereas Container B gets only about half.
  • Figure 4: Equilibria Overview.
  • Figure 5: Memory usage when containers exceed local memory lower protection.
  • ...and 4 more figures