Table of Contents
Fetching ...

Analyzing a Two-Tier Disaggregated Memory Protection Scheme Based on Memory Replication

Haris Volos, Yiannakis Sazeides

TL;DR

RAMP addresses the rising unreliability of memory in dense, scalable systems by introducing a two-tier resilience framework that couples a memory-protection tier with a memory-replication tier. It provides an analytic model to quantify reliability and performance and applies it to a concrete disaggregated-memory design, RAMP-DM, built on Hydra. The evaluation shows that weakening per-replica protection can significantly reduce storage overhead—from $27\%$ to $17.7\%$—while maintaining overall protection through replication, with minimal performance impact. This work offers a practical path to cost-effective, high-availability memory in disaggregated data-center architectures.

Abstract

As memory technologies continue to shrink and memory error rates increase, the demand for stronger reliability becomes increasingly critical. Fine-grain memory replication has emerged as an appealing approach to improving memory fault tolerance by augmenting conventional memory protection based on error-correcting codes with an additional layer of redundancy that replicates data across independent failure domains, such as replicating memory pages across different NUMA sockets. This method can tolerate a broad spectrum of memory errors, from individual memory cell failures to more complex memory controller failures. However, applying memory replication without a holistic consideration of the interaction between error-correcting codes and replication can result in redundant duplication and unnecessary storage overhead. We propose Replication-Aware Memory-error Protection (RAMP), a model that helps explore error protection strategies to improve the storage efficiency of memory protection in memory systems that utilize memory replication for performance and availability. We use RAMP to determine a protection strategy that can lower the storage cost of individual replicas while still ensuring robust protection through the collective protection conferred by multiple replicas. Our evaluation shows that a solution derived with RAMP enhances the storage efficiency of a state-of-the-art memory protection mechanism when paired with rack-level replication for disaggregated memory. Specifically, we can reduce the storage cost of memory protection from 27% down to 17.7% with minimal performance overhead.

Analyzing a Two-Tier Disaggregated Memory Protection Scheme Based on Memory Replication

TL;DR

RAMP addresses the rising unreliability of memory in dense, scalable systems by introducing a two-tier resilience framework that couples a memory-protection tier with a memory-replication tier. It provides an analytic model to quantify reliability and performance and applies it to a concrete disaggregated-memory design, RAMP-DM, built on Hydra. The evaluation shows that weakening per-replica protection can significantly reduce storage overhead—from to —while maintaining overall protection through replication, with minimal performance impact. This work offers a practical path to cost-effective, high-availability memory in disaggregated data-center architectures.

Abstract

As memory technologies continue to shrink and memory error rates increase, the demand for stronger reliability becomes increasingly critical. Fine-grain memory replication has emerged as an appealing approach to improving memory fault tolerance by augmenting conventional memory protection based on error-correcting codes with an additional layer of redundancy that replicates data across independent failure domains, such as replicating memory pages across different NUMA sockets. This method can tolerate a broad spectrum of memory errors, from individual memory cell failures to more complex memory controller failures. However, applying memory replication without a holistic consideration of the interaction between error-correcting codes and replication can result in redundant duplication and unnecessary storage overhead. We propose Replication-Aware Memory-error Protection (RAMP), a model that helps explore error protection strategies to improve the storage efficiency of memory protection in memory systems that utilize memory replication for performance and availability. We use RAMP to determine a protection strategy that can lower the storage cost of individual replicas while still ensuring robust protection through the collective protection conferred by multiple replicas. Our evaluation shows that a solution derived with RAMP enhances the storage efficiency of a state-of-the-art memory protection mechanism when paired with rack-level replication for disaggregated memory. Specifically, we can reduce the storage cost of memory protection from 27% down to 17.7% with minimal performance overhead.

Paper Structure

This paper contains 21 sections, 10 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Memory system architectures
  • Figure 2: Replication-Aware Memory-error Protection.
  • Figure 3: RAMP-DM system architecture.
  • Figure 4: Memcached throughput as completed queries per second (QPS) for different DUE fault rates.
  • Figure 5: Memcached average and P99 response latency for different DUE fault rates.
  • ...and 3 more figures