Table of Contents
Fetching ...

Managed-Retention Memory: A New Class of Memory for the AI Era

Sergey Legtchenko, Ioan Stefanovici, Richard Black, Antony Rowstron, Junyi Liu, Paolo Costa, Burcu Canakci, Dushyanth Narayanan, Xingbo Wu

TL;DR

The paper identifies a mismatch between AI foundation-model inference workloads and current HBM capabilities, emphasizing extreme read-dominated, sequential memory access with very large weights and KV caches. It proposes Managed-Retention Memory (MRM), a new memory class that relaxes long-term data retention to days or hours to achieve higher endurance, density, and read throughput, leveraging non-volatile memory technologies originally designed for SCM. The authors argue that by co-designing memory cells, controllers, and software (including retention-aware data placement, dynamic retention, and lightweight controllers), MRM can become a practical component alongside HBM in AI clusters. This cross-layer vision could reduce cost and energy per inference and unlock new hardware-software co-optimization opportunities for AI workloads.

Abstract

AI clusters today are one of the major uses of High Bandwidth Memory (HBM). However, HBM is suboptimal for AI workloads for several reasons. Analysis shows HBM is overprovisioned on write performance, but underprovisioned on density and read bandwidth, and also has significant energy per bit overheads. It is also expensive, with lower yield than DRAM due to manufacturing complexity. We propose a new memory class: Managed-Retention Memory (MRM), which is more optimized to store key data structures for AI inference workloads. We believe that MRM may finally provide a path to viability for technologies that were originally proposed to support Storage Class Memory (SCM). These technologies traditionally offered long-term persistence (10+ years) but provided poor IO performance and/or endurance. MRM makes different trade-offs, and by understanding the workload IO patterns, MRM foregoes long-term data retention and write performance for better potential performance on the metrics important for these workloads.

Managed-Retention Memory: A New Class of Memory for the AI Era

TL;DR

The paper identifies a mismatch between AI foundation-model inference workloads and current HBM capabilities, emphasizing extreme read-dominated, sequential memory access with very large weights and KV caches. It proposes Managed-Retention Memory (MRM), a new memory class that relaxes long-term data retention to days or hours to achieve higher endurance, density, and read throughput, leveraging non-volatile memory technologies originally designed for SCM. The authors argue that by co-designing memory cells, controllers, and software (including retention-aware data placement, dynamic retention, and lightweight controllers), MRM can become a practical component alongside HBM in AI clusters. This cross-layer vision could reduce cost and energy per inference and unlock new hardware-software co-optimization opportunities for AI workloads.

Abstract

AI clusters today are one of the major uses of High Bandwidth Memory (HBM). However, HBM is suboptimal for AI workloads for several reasons. Analysis shows HBM is overprovisioned on write performance, but underprovisioned on density and read bandwidth, and also has significant energy per bit overheads. It is also expensive, with lower yield than DRAM due to manufacturing complexity. We propose a new memory class: Managed-Retention Memory (MRM), which is more optimized to store key data structures for AI inference workloads. We believe that MRM may finally provide a path to viability for technologies that were originally proposed to support Storage Class Memory (SCM). These technologies traditionally offered long-term persistence (10+ years) but provided poor IO performance and/or endurance. MRM makes different trade-offs, and by understanding the workload IO patterns, MRM foregoes long-term data retention and write performance for better potential performance on the metrics important for these workloads.
Paper Structure (8 sections, 1 figure)

This paper contains 8 sections, 1 figure.

Figures (1)

  • Figure 1: Endurance requirements for KV cache and model weights vs. endurance of memory technologies.