Table of Contents
Fetching ...

Pushing the Memory Bandwidth Wall with CXL-enabled Idle I/O Bandwidth Harvesting

Divya Kiran Kadiyala, Alexandros Daglis

TL;DR

The paper tackles the memory bandwidth wall in bandwidth-constrained manycore CPUs by enabling dynamic multiplexing of memory and I/O traffic over CXL. It introduces SURGE, a software-assisted architectural approach that salvages idle I/O bandwidth to expand memory bandwidth, realized through two architectural embodiments (Surge Solo and Surge Pod) and complemented by cluster-manager-driven workload placement and an analytical traffic-split model. The evaluation on a CXL-based, memory-bandwidth-starved setup demonstrates up to $1.3\times$ speedups under idle I/O and up to $1.2\times$ under higher I/O traffic, with SURGE showing robustness to interference and variability in workload characteristics. This work offers a practical mechanism to improve memory bandwidth utilization in modern servers, potentially reducing energy and cost while enabling better performance for memory-intensive workloads.

Abstract

The continual increase of cores on server-grade CPUs raises demands on memory systems, which are constrained by limited off-chip pin and data transfer rate scalability. As a result, high-end processors typically feature lower memory bandwidth per core, at the detriment of memory-intensive workloads. We propose alleviating this challenge by improving the utility of the CPU's limited pins. In a typical CPU design process, the available pins are apportioned between memory and I/O traffic, each accounting for about half of the total off-chip bandwidth availability. Consequently, unless both memory and I/O are simultaneously highly utilized, such fragmentation leads to underutilization of the valuable off-chip bandwidth resources. An ideal architecture would offer I/O and memory bandwidth fungibility, allowing use of the aggregate off-chip bandwidth in the form required by each workload. In this work, we introduce SURGE, a software-supported architectural technique that boosts memory bandwidth availability by salvaging idle I/O bandwidth resources. SURGE leverages the capability of versatile interconnect technologies like CXL to dynamically multiplex memory and I/O traffic over the same processor interface. We demonstrate that SURGE-enhanced architectures can accelerate memory-intensive workloads on bandwidth-constrained servers by up to 1.3x.

Pushing the Memory Bandwidth Wall with CXL-enabled Idle I/O Bandwidth Harvesting

TL;DR

The paper tackles the memory bandwidth wall in bandwidth-constrained manycore CPUs by enabling dynamic multiplexing of memory and I/O traffic over CXL. It introduces SURGE, a software-assisted architectural approach that salvages idle I/O bandwidth to expand memory bandwidth, realized through two architectural embodiments (Surge Solo and Surge Pod) and complemented by cluster-manager-driven workload placement and an analytical traffic-split model. The evaluation on a CXL-based, memory-bandwidth-starved setup demonstrates up to speedups under idle I/O and up to under higher I/O traffic, with SURGE showing robustness to interference and variability in workload characteristics. This work offers a practical mechanism to improve memory bandwidth utilization in modern servers, potentially reducing energy and cost while enabling better performance for memory-intensive workloads.

Abstract

The continual increase of cores on server-grade CPUs raises demands on memory systems, which are constrained by limited off-chip pin and data transfer rate scalability. As a result, high-end processors typically feature lower memory bandwidth per core, at the detriment of memory-intensive workloads. We propose alleviating this challenge by improving the utility of the CPU's limited pins. In a typical CPU design process, the available pins are apportioned between memory and I/O traffic, each accounting for about half of the total off-chip bandwidth availability. Consequently, unless both memory and I/O are simultaneously highly utilized, such fragmentation leads to underutilization of the valuable off-chip bandwidth resources. An ideal architecture would offer I/O and memory bandwidth fungibility, allowing use of the aggregate off-chip bandwidth in the form required by each workload. In this work, we introduce SURGE, a software-supported architectural technique that boosts memory bandwidth availability by salvaging idle I/O bandwidth resources. SURGE leverages the capability of versatile interconnect technologies like CXL to dynamically multiplex memory and I/O traffic over the same processor interface. We demonstrate that SURGE-enhanced architectures can accelerate memory-intensive workloads on bandwidth-constrained servers by up to 1.3x.

Paper Structure

This paper contains 26 sections, 4 equations, 16 figures, 1 table.

Figures (16)

  • Figure 1: Bandwidth characteristics of modern manycore processors. SKUs sampled from the AMD EPYC and Intel Xeon server processor families.
  • Figure 2: Per-core memory bandwidth demands across evaluated workloads and ranges where two modern CPUs encounter queuing delays.
  • Figure 3: Surge hardware and software components.
  • Figure 4: Two Surge architectural embodiments.
  • Figure 5: Salvage memory utility for Surge as a function of pod size and P (probability of each individual salvage link having sufficient idle bandwidth to be salvaged by Surge).
  • ...and 11 more figures