Towards CXL Resilience to CPU Failures

Antonis Psistakis; Burak Ocalan; Chloe Alverti; Fabien Chaix; Ramnatthan Alagappan; Josep Torrellas

Towards CXL Resilience to CPU Failures

Antonis Psistakis, Burak Ocalan, Chloe Alverti, Fabien Chaix, Ramnatthan Alagappan, Josep Torrellas

TL;DR

This work tackles resilience in CXL-based distributed shared memory by extending the CXL specification to tolerate Compute Node failures. It introduces ReCXL, which augments remote writes with replication to a small set of replica CNs, logs updates in hardware Logging Units, and periodically dumps logs to Memory Nodes for recovery. A software-driven recovery protocol reconstructs directory and memory state from logs to restore consistent execution. Evaluation in a 16 CNs/16 MNs cluster shows ReCXL enables fault-tolerant execution with only about a 30% slowdown compared with a fault-intolerant baseline, highlighting practical viability for fault-tolerant CXL clusters.

Abstract

Compute Express Link (CXL) 3.0 and beyond allows the compute nodes of a cluster to share data with hardware cache coherence and at the granularity of a cache line. This enables shared-memory semantics for distributed computing, but introduces new resilience challenges: a node failure leads to the loss of the dirty data in its caches, corrupting application state. Unfortunately, the CXL specification does not consider processor failures. Moreover, when a component fails, the specification tries to isolate it and continue application execution; there is no attempt to bring the application to a consistent state. To address these limitations, this paper extends the CXL specification to be resilient to node failures, and to correctly recover the application after node failures. We call the system ReCXL. To handle the failure of nodes, ReCXL augments the coherence transaction of a write with messages that propagate the update to a small set of other nodes (i.e., Replicas). Replicas save the update in a hardware Logging Unit. Such replication ensures resilience to node failures. Then, at regular intervals, the Logging Units dump the updates to memory. Recovery involves using the logs in the Logging Units to bring the directory and memory to a correct state. Our evaluation shows that ReCXL enables fault-tolerant execution with only a 30% slowdown over the same platform with no fault-tolerance support.

Towards CXL Resilience to CPU Failures

TL;DR

Abstract

Towards CXL Resilience to CPU Failures

Authors

TL;DR

Abstract

Table of Contents

Figures (18)