Table of Contents
Fetching ...

Towards Disaggregation-Native Data Streaming between Devices

Nils Asmussen, Michael Roitzsch

TL;DR

The paper tackles data-movement bottlenecks in disaggregated datacenters enabled by fabrics such as CXL, where CPU-centric staging can negate potential latency benefits. It proposes disaggregation-native devices carrying a device-independent data streaming facility and analyzes three protocol-placement strategies, advocating distributed resource-side protocols to minimize hops. Using the M3 architecture with DTU-enabled tiles, it outlines architectural components, device heterogeneity considerations, access control, and protocol implementation approaches, and discusses cross-machine extensions as a core challenge. A gem5-based evaluation demonstrates substantial latency improvements for distributed protocols (up to ~67% faster than app-side and ~25% faster than central), while acknowledging simulation limitations and open questions about mapping security primitives onto CXL fabrics for robust isolation.

Abstract

Disaggregation is an ongoing trend to increase flexibility in datacenters. With interconnect technologies like CXL, pools of CPUs, accelerators, and memory can be connected via a datacenter fabric. Applications can then pick from those pools the resources necessary for their specific workload. However, this vision becomes less clear when we consider data movement. Workloads often require data to be streamed through chains of multiple devices, but typically, these data streams physically do not directly flow device-to-device, but are staged in memory by a CPU hosting device protocol logic. We show that augmenting devices with a disaggregation-native and device-independent data streaming facility can improve processing latencies by enabling data flows directly between arbitrary devices.

Towards Disaggregation-Native Data Streaming between Devices

TL;DR

The paper tackles data-movement bottlenecks in disaggregated datacenters enabled by fabrics such as CXL, where CPU-centric staging can negate potential latency benefits. It proposes disaggregation-native devices carrying a device-independent data streaming facility and analyzes three protocol-placement strategies, advocating distributed resource-side protocols to minimize hops. Using the M3 architecture with DTU-enabled tiles, it outlines architectural components, device heterogeneity considerations, access control, and protocol implementation approaches, and discusses cross-machine extensions as a core challenge. A gem5-based evaluation demonstrates substantial latency improvements for distributed protocols (up to ~67% faster than app-side and ~25% faster than central), while acknowledging simulation limitations and open questions about mapping security primitives onto CXL fabrics for robust isolation.

Abstract

Disaggregation is an ongoing trend to increase flexibility in datacenters. With interconnect technologies like CXL, pools of CPUs, accelerators, and memory can be connected via a datacenter fabric. Applications can then pick from those pools the resources necessary for their specific workload. However, this vision becomes less clear when we consider data movement. Workloads often require data to be streamed through chains of multiple devices, but typically, these data streams physically do not directly flow device-to-device, but are staged in memory by a CPU hosting device protocol logic. We show that augmenting devices with a disaggregation-native and device-independent data streaming facility can improve processing latencies by enabling data flows directly between arbitrary devices.
Paper Structure (21 sections, 3 figures)

This paper contains 21 sections, 3 figures.

Figures (3)

  • Figure 1: Different options for executing a protocol between application and devices: application-side protocol (left), central resource-side protocol (center), and distributed resource-side protocol (right). The dashed lines indicate machine boundaries.
  • Figure 2: System architecture of M3: one DTU per tile isolates tiles from each other and selectively allows communication as configured by the M3 kernel. TileMux multiplexes its tile among the applications on this tile.
  • Figure 3: Performance comparison of different protocol placements using different data sizes.