Octopus: Scalable Low-Cost CXL Memory Pooling

Daniel S. Berger; Yuhong Zhong; Fiodar Kazhamiaka; Pantea Zardoshti; Shuwei Teng; Mark D. Hill; Rodrigo Fonseca

Octopus: Scalable Low-Cost CXL Memory Pooling

Daniel S. Berger, Yuhong Zhong, Fiodar Kazhamiaka, Pantea Zardoshti, Shuwei Teng, Mark D. Hill, Rodrigo Fonseca

TL;DR

This work tackles the high cost and scalability limits of memory pooling in data-center pods by rethinking CXL pod topology. It introduces Octopus, a class of minimally-connected pod designs that rely on small-port pooling devices and combinatorial design techniques (BIBDs) to connect each host to a bounded set of PDs while ensuring pairwise memory sharing among hosts. The paper formalizes memory provisioning guarantees, presents an algorithmic memory-allocation strategy with a provable bound on extra memory requirements, and evaluates the approach through production-trace simulations and hardware experiments. Results show that Octopus achieves memory savings comparable to large, expensive pooling designs and delivers up to 3× lower RPC latency than RDMA, while significantly reducing PD costs and enabling larger pod sizes for memory pooling in practice.

Abstract

Compute Express Link (CXL) enables compute "pods" with memory pooling across hosts to reduce cost and improve efficiency. Existing pods are small, use exotic many-ported pooling devices, or require indirection through expensive switches. These conventional designs implicitly assume that pods must fully connect all hosts to all CXL pooling devices. This paper breaks with this conventional wisdom to create "Octopus" pods. Octopus connects each host to a bounded number of pooling devices (e.g., 8), each pooling device connects to different subsets of hosts, and all host pairs share at least one pooling device. Despite no longer having a global memory pool, we show that Octopus pods still effectively support memory pooling, as well as various communication patterns. Relative to conventional pods, Octopus is more cost-effective (using near-commodity pooling devices) and enables larger pods (allowing more pooling flexibility and greater communication reach). Simulations on production traces show Octopus achieves memory savings comparable to expensive pool designs. Hardware experiments confirm that Octopus reduces RPC latency by 3x compared to RDMA. Our work formalizes Octopus topologies, develops memory allocation algorithms, and evaluates performance tradeoffs through simulation and hardware testing.

Octopus: Scalable Low-Cost CXL Memory Pooling

TL;DR

Abstract

Paper Structure (37 sections, 2 theorems, 9 equations, 17 figures, 5 tables)

This paper contains 37 sections, 2 theorems, 9 equations, 17 figures, 5 tables.

Introduction
CXL Background and Use Cases
CXL Overview
CXL Pod Use Cases
Common Use Case Requirements
CXL Building Blocks and Existing Designs
Pooling Device Costs
Existing CXL Pod Designs
Switched CXL Pods
Fully-connected CXL Pods based on Multi-Ported Devices
Octopus Overview and Foundations
Connectivity
Memory provisioning
Octopus Hardware Design
Logical Network Topology
...and 22 more sections

Key Result

theorem 1

Given a minimally-connected Octopus pod with $H$ hosts, $X$ ports per host, and $N$ ports per PD, and a set of host memory capacity demands $D_1,..,D_H$. Let $\mu$ be the average CXL memory demand across hosts $\left(\mu = \sum_{i=1}^{H} D_i/H \right )$ and let $D_{(1)}, D_{(2)},...D_{(H)}$ be the h when $\alpha$ satisfies the following condition for all $k = {1..H}$:

Figures (17)

Figure 1: Conventional CXL pod designs assume that all 16 hosts ($H_0$ to $H_{15}$) connect to all pooling devices, which requires a still-expensive 16-ported device. Octopus introduces minimally-connected pod designs based on near-commodity 4-port pooling devices, which cuts pooling device cost in half.
Figure 2: Multi-ported device with two CXL ports. Each port offers x8 CXL lanes.
Figure 3: Larger CXL pods lead to higher DRAM savings.
Figure 4: Die area estimates for PDs with different numbers of CXL ports and DDR5 channels. Note that for visual simplicity we show the logic and network-on-chip (NOC) area as a single block.
Figure 5: A 3-rack Octopus configuration.
...and 12 more figures

Theorems & Definitions (5)

theorem 1
definition 1
definition 2
definition 3
lemma 1

Octopus: Scalable Low-Cost CXL Memory Pooling

TL;DR

Abstract

Octopus: Scalable Low-Cost CXL Memory Pooling

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (17)

Theorems & Definitions (5)