Phoenix -- A Novel Technique for Performance-Aware Orchestration of Thread and Page Table Placement in NUMA Systems
Mohammad Siavashi, Alireza Sanaee, Mohsen Sharifi, Gianni Antichi
TL;DR
NUMA systems incur remote page-walk penalties and coherency overhead from page-table replication. Phoenix introduces an integrated OS approach that differentiates page tables from data pages, enabling on-demand replication and direct page-table migration coordinated with thread scheduling, while also leveraging memory-bandwidth management to preserve QoS. Its combined design—thread consolidation, home-node placement, and selective replication—demonstrates substantial performance gains on real hardware over prior work, with improved CPU-cycle and page-walk efficiency and better resilience to inter-socket interference. Implemented as a Linux loadable kernel module, Phoenix achieves practical, production-friendly improvements and highlights the value of co-designing schedulers and memory managers for NUMA-aware systems.
Abstract
The emergence of symmetric multi-processing (SMP) systems with non-uniform memory access (NUMA) has prompted extensive research on process and data placement to mitigate the performance impact of NUMA on applications. However, existing solutions often overlook the coordination between the CPU scheduler and memory manager, leading to inefficient thread and page table placement. Moreover, replication techniques employed to improve locality suffer from redundant replicas, scalability barriers, and performance degradation due to memory bandwidth and inter-socket interference. In this paper, we present Phoenix, a novel integrated CPU scheduler and memory manager with on-demand page table replication mechanism. Phoenix integrates the CPU scheduler and memory management subsystems, allowing for coordinated thread and page table placement. By differentiating between data and page table pages, Phoenix enables direct migration or replication of page tables based on application behavior. Additionally, Phoenix employs memory bandwidth management mechanism to maintain Quality of Service (QoS) while mitigating coherency maintenance overhead. We implemented Phoenix as a loadable kernel module for Linux, ensuring compatibility with legacy applications and ease of deployment. Our evaluation on real hardware demonstrates that Phoenix reduces CPU cycles by 2.09x and page-walk cycles by 1.58x compared to state-of-the-art solutions.
