INDIGO: Page Migration for Hardware Memory Disaggregation Across a Network
Archit Patke, Christian Pinto, Saurabh Jha, Haoran Qiu, Zbigniew Kalbarczyk, Ravishankar Iyer
TL;DR
INDIGO tackles the remote-memory performance penalty in hardware memory disaggregation by introducing a network-aware page migration framework that accounts for variable transfer costs under network contention. The approach combines Page Telemetry (burst-duration-aware access-rate estimation) with a Page Promoter controlled by a contextual multi-armed bandit to decide when to migrate pages, based on locality benefits, local memory constraints, and network conditions. Evaluations on a real HMD prototype with cloud and HPC workloads show up to 50-70% improvements in application performance and up to 3x reductions in network traffic, with reasonable training cost and robustness to unseen workloads. These results indicate that telemetry-driven, learning-based page migration can significantly reduce remote-memory penalties and lower the total cost of ownership for memory-disaggregated data centers.
Abstract
Hardware memory disaggregation (HMD) is an emerging technology that enables access to remote memory, thereby creating expansive memory pools and reducing memory underutilization in datacenters. However, a significant challenge arises when accessing remote memory over a network: increased contention that can lead to severe application performance degradation. To reduce the performance penalty of using remote memory, the operating system uses page migration to promote frequently accessed pages closer to the processor. However, previously proposed page migration mechanisms do not achieve the best performance in HMD systems because of obliviousness to variable page transfer costs that occur due to network contention. To address these limitations, we present INDIGO: a network-aware page migration framework that uses novel page telemetry and a learning-based approach for network adaptation. We implemented INDIGO in the Linux kernel and evaluated it with common cloud and HPC applications on a real disaggregated memory system prototype. Our evaluation shows that INDIGO offers up to 50-70% improvement in application performance compared to other state-of-the-art page migration policies and reduces network traffic up to 2x.
