Table of Contents
Fetching ...

Phoenix -- A Novel Technique for Performance-Aware Orchestration of Thread and Page Table Placement in NUMA Systems

Mohammad Siavashi, Alireza Sanaee, Mohsen Sharifi, Gianni Antichi

TL;DR

NUMA systems incur remote page-walk penalties and coherency overhead from page-table replication. Phoenix introduces an integrated OS approach that differentiates page tables from data pages, enabling on-demand replication and direct page-table migration coordinated with thread scheduling, while also leveraging memory-bandwidth management to preserve QoS. Its combined design—thread consolidation, home-node placement, and selective replication—demonstrates substantial performance gains on real hardware over prior work, with improved CPU-cycle and page-walk efficiency and better resilience to inter-socket interference. Implemented as a Linux loadable kernel module, Phoenix achieves practical, production-friendly improvements and highlights the value of co-designing schedulers and memory managers for NUMA-aware systems.

Abstract

The emergence of symmetric multi-processing (SMP) systems with non-uniform memory access (NUMA) has prompted extensive research on process and data placement to mitigate the performance impact of NUMA on applications. However, existing solutions often overlook the coordination between the CPU scheduler and memory manager, leading to inefficient thread and page table placement. Moreover, replication techniques employed to improve locality suffer from redundant replicas, scalability barriers, and performance degradation due to memory bandwidth and inter-socket interference. In this paper, we present Phoenix, a novel integrated CPU scheduler and memory manager with on-demand page table replication mechanism. Phoenix integrates the CPU scheduler and memory management subsystems, allowing for coordinated thread and page table placement. By differentiating between data and page table pages, Phoenix enables direct migration or replication of page tables based on application behavior. Additionally, Phoenix employs memory bandwidth management mechanism to maintain Quality of Service (QoS) while mitigating coherency maintenance overhead. We implemented Phoenix as a loadable kernel module for Linux, ensuring compatibility with legacy applications and ease of deployment. Our evaluation on real hardware demonstrates that Phoenix reduces CPU cycles by 2.09x and page-walk cycles by 1.58x compared to state-of-the-art solutions.

Phoenix -- A Novel Technique for Performance-Aware Orchestration of Thread and Page Table Placement in NUMA Systems

TL;DR

NUMA systems incur remote page-walk penalties and coherency overhead from page-table replication. Phoenix introduces an integrated OS approach that differentiates page tables from data pages, enabling on-demand replication and direct page-table migration coordinated with thread scheduling, while also leveraging memory-bandwidth management to preserve QoS. Its combined design—thread consolidation, home-node placement, and selective replication—demonstrates substantial performance gains on real hardware over prior work, with improved CPU-cycle and page-walk efficiency and better resilience to inter-socket interference. Implemented as a Linux loadable kernel module, Phoenix achieves practical, production-friendly improvements and highlights the value of co-designing schedulers and memory managers for NUMA-aware systems.

Abstract

The emergence of symmetric multi-processing (SMP) systems with non-uniform memory access (NUMA) has prompted extensive research on process and data placement to mitigate the performance impact of NUMA on applications. However, existing solutions often overlook the coordination between the CPU scheduler and memory manager, leading to inefficient thread and page table placement. Moreover, replication techniques employed to improve locality suffer from redundant replicas, scalability barriers, and performance degradation due to memory bandwidth and inter-socket interference. In this paper, we present Phoenix, a novel integrated CPU scheduler and memory manager with on-demand page table replication mechanism. Phoenix integrates the CPU scheduler and memory management subsystems, allowing for coordinated thread and page table placement. By differentiating between data and page table pages, Phoenix enables direct migration or replication of page tables based on application behavior. Additionally, Phoenix employs memory bandwidth management mechanism to maintain Quality of Service (QoS) while mitigating coherency maintenance overhead. We implemented Phoenix as a loadable kernel module for Linux, ensuring compatibility with legacy applications and ease of deployment. Our evaluation on real hardware demonstrates that Phoenix reduces CPU cycles by 2.09x and page-walk cycles by 1.58x compared to state-of-the-art solutions.

Paper Structure

This paper contains 28 sections, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Comparing thread and page table placement in Linux, Mitosis, and Phoenix running two applications. Phoenix can consolidate threads of an application on more than one node, prioritizing the nodes with the fastest interconnection for faster communication. In this figure, the basic scenario is depicted for better understanding.
  • Figure 2: Performance comparison between Linux and Mitosis with 2-way replication running BTree under heavy memory bandwidth interference.
  • Figure 3: Apache web-server experiences higher latency with a 2-way replication. The $99\%$ percentile tail latency increased by $19.9\%$.
  • Figure 4: Overhead of replication on memory management system calls as the number of replicas grows. Mitosis uses page cache that causes mmap line to drop when replication turned on.
  • Figure 5: Example of an SMP NUMA system
  • ...and 6 more figures