Table of Contents
Fetching ...

INDIGO: Page Migration for Hardware Memory Disaggregation Across a Network

Archit Patke, Christian Pinto, Saurabh Jha, Haoran Qiu, Zbigniew Kalbarczyk, Ravishankar Iyer

TL;DR

INDIGO tackles the remote-memory performance penalty in hardware memory disaggregation by introducing a network-aware page migration framework that accounts for variable transfer costs under network contention. The approach combines Page Telemetry (burst-duration-aware access-rate estimation) with a Page Promoter controlled by a contextual multi-armed bandit to decide when to migrate pages, based on locality benefits, local memory constraints, and network conditions. Evaluations on a real HMD prototype with cloud and HPC workloads show up to 50-70% improvements in application performance and up to 3x reductions in network traffic, with reasonable training cost and robustness to unseen workloads. These results indicate that telemetry-driven, learning-based page migration can significantly reduce remote-memory penalties and lower the total cost of ownership for memory-disaggregated data centers.

Abstract

Hardware memory disaggregation (HMD) is an emerging technology that enables access to remote memory, thereby creating expansive memory pools and reducing memory underutilization in datacenters. However, a significant challenge arises when accessing remote memory over a network: increased contention that can lead to severe application performance degradation. To reduce the performance penalty of using remote memory, the operating system uses page migration to promote frequently accessed pages closer to the processor. However, previously proposed page migration mechanisms do not achieve the best performance in HMD systems because of obliviousness to variable page transfer costs that occur due to network contention. To address these limitations, we present INDIGO: a network-aware page migration framework that uses novel page telemetry and a learning-based approach for network adaptation. We implemented INDIGO in the Linux kernel and evaluated it with common cloud and HPC applications on a real disaggregated memory system prototype. Our evaluation shows that INDIGO offers up to 50-70% improvement in application performance compared to other state-of-the-art page migration policies and reduces network traffic up to 2x.

INDIGO: Page Migration for Hardware Memory Disaggregation Across a Network

TL;DR

INDIGO tackles the remote-memory performance penalty in hardware memory disaggregation by introducing a network-aware page migration framework that accounts for variable transfer costs under network contention. The approach combines Page Telemetry (burst-duration-aware access-rate estimation) with a Page Promoter controlled by a contextual multi-armed bandit to decide when to migrate pages, based on locality benefits, local memory constraints, and network conditions. Evaluations on a real HMD prototype with cloud and HPC workloads show up to 50-70% improvements in application performance and up to 3x reductions in network traffic, with reasonable training cost and robustness to unseen workloads. These results indicate that telemetry-driven, learning-based page migration can significantly reduce remote-memory penalties and lower the total cost of ownership for memory-disaggregated data centers.

Abstract

Hardware memory disaggregation (HMD) is an emerging technology that enables access to remote memory, thereby creating expansive memory pools and reducing memory underutilization in datacenters. However, a significant challenge arises when accessing remote memory over a network: increased contention that can lead to severe application performance degradation. To reduce the performance penalty of using remote memory, the operating system uses page migration to promote frequently accessed pages closer to the processor. However, previously proposed page migration mechanisms do not achieve the best performance in HMD systems because of obliviousness to variable page transfer costs that occur due to network contention. To address these limitations, we present INDIGO: a network-aware page migration framework that uses novel page telemetry and a learning-based approach for network adaptation. We implemented INDIGO in the Linux kernel and evaluated it with common cloud and HPC applications on a real disaggregated memory system prototype. Our evaluation shows that INDIGO offers up to 50-70% improvement in application performance compared to other state-of-the-art page migration policies and reduces network traffic up to 2x.

Paper Structure

This paper contains 25 sections, 2 equations, 14 figures, 4 tables, 1 algorithm.

Figures (14)

  • Figure 1: Compared to tiered memory systems, multiple nodes contend for the same remote memory pool leading to network contention in HMD systems.
  • Figure 2: Previously proposed page migration mechanisms do not consider the variable page transfer costs in HMD systems. (Left) Page transfer cost for 4 KB pages in tiered memory and HMD systems under varying network contention. (Right) INDIGO considers the impact of variable page transfer costs with Page Promoter and Page Telemetry resulting in decreased application runtime. Each application was run with local memory allocation = 10% of the working set size and compared with the best runtime between TPP maruf2022tpp, MEMTIS memtis_sosp and Nimble yan2019nimble
  • Figure 3: Runtime degradation for different page migration mechanisms.
  • Figure 4: Page promotion rates with different page migration mechanisms.
  • Figure 5: Dynamically shifting application access patterns for a BFS application.
  • ...and 9 more figures