Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent Interconnects

Anastasiia Ruzhanskaia; Pengcheng Xu; David Cock; Timothy Roscoe

Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent Interconnects

Anastasiia Ruzhanskaia, Pengcheng Xu, David Cock, Timothy Roscoe

TL;DR

Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent Interconnects investigates the latency-critical, fine-grained CPU–device communication paradigm. It demonstrates that a cache-coherent interconnect can enable low-latency programmed I/O with device participation in the coherence protocol, achieving median latencies around $0.9$–$1.6\mu s$ and competitive throughput relative to descriptor-based DMA. The authors implement a family of coherence-based messaging protocols on the Enzian hardware platform and evaluate them across accelerator invocation, NIC-like data movement, and Timely Dataflow offload, showing clear latency and tail-latency advantages for small messages and substantial improvements for certain workloads such as Bloom filters. The work outlines a general design space for coherence-based CPU–device collaboration and discusses generality to standards like CCIX, TileLink, and especially CXL.mem 3.0, providing practical guidance for deploying low-latency IO in data-center workloads.

Abstract

Conventional wisdom holds that an efficient interface between an OS running on a CPU and a high-bandwidth I/O device should use Direct Memory Access (DMA) to offload data transfer, descriptor rings for buffering and queuing, and interrupts for asynchrony between cores and device. In this paper we question this wisdom in the light of two trends: modern and emerging cache-coherent interconnects like CXL3.0, and workloads, particularly microservices and serverless computing. Like some others before us, we argue that the assumptions of the DMA-based model are obsolete, and in many use-cases programmed I/O, where the CPU explicitly transfers data and control information to and from a device via loads and stores, delivers a more efficient system. However, we push this idea much further. We show, in a real hardware implementation, the gains in latency for fine-grained communication achievable using an open cache-coherence protocol which exposes cache transitions to a smart device, and that throughput is competitive with DMA over modern interconnects. We also demonstrate three use-cases: fine-grained RPC-style invocation of functions on an accelerator, offloading of operators in a streaming dataflow engine, and a network interface targeting serverless functions, comparing our use of coherence with both traditional DMA-style interaction and a highly-optimized implementation using memory-mapped programmed I/O over PCIe.

Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent Interconnects

TL;DR

–

and competitive throughput relative to descriptor-based DMA. The authors implement a family of coherence-based messaging protocols on the Enzian hardware platform and evaluate them across accelerator invocation, NIC-like data movement, and Timely Dataflow offload, showing clear latency and tail-latency advantages for small messages and substantial improvements for certain workloads such as Bloom filters. The work outlines a general design space for coherence-based CPU–device collaboration and discusses generality to standards like CCIX, TileLink, and especially CXL.mem 3.0, providing practical guidance for deploying low-latency IO in data-center workloads.

Abstract

Paper Structure (32 sections, 12 figures, 1 table)

This paper contains 32 sections, 12 figures, 1 table.

Introduction
Background and Motivation
Interconnects and devices:
Fine-grained workloads:
Experimental platform
performance over :
performance over
PIO over a coherent interconnect
Fast CPU--CPU message passing
The implications of message-level access
CPU-device message passing with coherence
Returning a line in Exclusive
Handling larger messages
Handling timeouts
Avoiding deadlocks
...and 17 more sections

Figures (12)

Figure 1: PCIe XDMA invocation latency comparison.
Figure 2: PCIe PIO invocation latency comparison.
Figure 3: Invoking with & ; error handling omitted.
Figure 4: FastForward-style coherent messaging
Figure 5: Protocol variants for efficient CPU-device messaging
...and 7 more figures

Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent Interconnects

TL;DR

Abstract

Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent Interconnects

Authors

TL;DR

Abstract

Table of Contents

Figures (12)