The Case for Replication-Aware Memory-Error Protection in Disaggregated Memory

Haris Volos

The Case for Replication-Aware Memory-Error Protection in Disaggregated Memory

Haris Volos

TL;DR

The paper tackles the rising memory-error challenges in disaggregated memory built on high-density NVM by presenting Replication-Aware Memory-error Protection (RAMP), which co-designs application-level data replication with hardware-level memory protection to reduce per-replica storage overhead. It introduces an analytical model that quantifies the trade-offs between protection strength, reliability, and performance when leveraging replication across multiple memory nodes. Empirical results on chipkill-based designs show that per-replica protection can be weakened (reducing storage overhead) while achieving the same overall protection through replication, e.g., lowering overhead from $27\%$ to $17.7\%$ and only requiring a marginal extra cost to meet tight SDC targets like $10^{-22}$. The work demonstrates a practical path to cost-effective, scalable memory protection for rack-scale NVM by exploiting the replication already used for availability and performance in data-centric applications.

Abstract

Disaggregated memory leverages recent technology advances in high-density, byte-addressable non-volatile memory and high-performance interconnects to provide a large memory pool shared across multiple compute nodes. Due to higher memory density, memory errors may become more frequent. Unfortunately, tolerating memory errors through existing memory-error protection techniques becomes impractical due to increasing storage cost. This work proposes replication-aware memory-error protection to improve storage efficiency of protection in data-centric applications that already rely on memory replication for performance and availability. It lets such applications lower protection storage cost by weakening the protection of each individual replica, but still realize a strong protection target by relying on the collective protection conferred by multiple replicas.

The Case for Replication-Aware Memory-Error Protection in Disaggregated Memory

TL;DR

and only requiring a marginal extra cost to meet tight SDC targets like

. The work demonstrates a practical path to cost-effective, scalable memory protection for rack-scale NVM by exploiting the replication already used for availability and performance in data-centric applications.

Abstract

Paper Structure (10 sections, 6 equations, 3 figures, 1 table)

This paper contains 10 sections, 6 equations, 3 figures, 1 table.

Introduction
Disaggregated Memory
Enabling technologies and architecture
Memory failures
Replication-Aware Memory-error Protection
Architecture
Tolerating memory errors through replication
Choosing replica protection strength
Replication-Aware Chipkill-Correct
Conclusion

Figures (3)

Figure 1: Rack-scale non-volatile memory (NVM) architectures
Figure 2: RAMP architecture.
Figure 3: DUE, NDE, and storage overhead for different chipkill protection schemes. The solid rectangle (in the top-left figure) marks the DUE and storage overhead of the original chipkill design zhang:pm-chipkill:micro:2018. NDE is shown only for baseline chipkill as it is independent of replication and identical for all chipkill schemes.

The Case for Replication-Aware Memory-Error Protection in Disaggregated Memory

TL;DR

Abstract

The Case for Replication-Aware Memory-Error Protection in Disaggregated Memory

Authors

TL;DR

Abstract

Table of Contents

Figures (3)