Table of Contents
Fetching ...

numaPTE: Managing Page-Tables and TLBs on NUMA Systems

Bin Gao, Qingxuan Kang, Hao-Wei Tee, Kyle Timothy Ng Chu, Alireza Sanaee, Djordje Jevdjic

TL;DR

numaPTE tackles the severe overheads of memory-management operations on NUMA systems by enabling on-demand, partial page-table replication and a decentralized coherence protocol. By owning allocations and replicating only demanded PTEs on the relevant NUMA nodes, it localizes address translations and dramatically reduces TLB shootdowns without the memory and coherence costs of full replication. The Linux implementation on x86_64 shows substantial improvements across benchmarks and real-world workloads, with 12% Webserver and 36% Memcached runtime gains and up to 40x reductions in TLB-related penalties under contention. Overall, numaPTE provides scalable, locality-aware memory-management support that preserves performance as NUMA scales to many sockets.

Abstract

Memory management operations that modify page-tables, typically performed during memory allocation/deallocation, are infamous for their poor performance in highly threaded applications, largely due to process-wide TLB shootdowns that the OS must issue due to the lack of hardware support for TLB coherence. We study these operations in NUMA settings, where we observe up to 40x overhead for basic operations such as munmap or mprotect. The overhead further increases if page-table replication is used, where complete coherent copies of the page-tables are maintained across all NUMA nodes. While eager system-wide replication is extremely effective at localizing page-table reads during address translation, we find that it creates additional penalties upon any page-table changes due to the need to maintain all replicas coherent. In this paper, we propose a novel page-table management mechanism, called numaPTE, to enable transparent, on-demand, and partial page-table replication across NUMA nodes in order to perform address translation locally, while avoiding the overheads and scalability issues of system-wide full page-table replication. We then show that numaPTE's precise knowledge of page-table sharers can be leveraged to significantly reduce the number of TLB shootdowns issued upon any memory-management operation. As a result, numaPTE not only avoids replication-related slowdowns, but also provides significant speedup over the baseline on memory allocation/deallocation and access control operations. We implement numaPTEin Linux on x86_64, evaluate it on 4- and 8-socket systems, and show that numaPTE achieves the full benefits of eager page-table replication on a wide range of applications, while also achieving a 12% and 36% runtime improvement on Webserver and Memcached respectively due to a significant reduction in TLB shootdowns.

numaPTE: Managing Page-Tables and TLBs on NUMA Systems

TL;DR

numaPTE tackles the severe overheads of memory-management operations on NUMA systems by enabling on-demand, partial page-table replication and a decentralized coherence protocol. By owning allocations and replicating only demanded PTEs on the relevant NUMA nodes, it localizes address translations and dramatically reduces TLB shootdowns without the memory and coherence costs of full replication. The Linux implementation on x86_64 shows substantial improvements across benchmarks and real-world workloads, with 12% Webserver and 36% Memcached runtime gains and up to 40x reductions in TLB-related penalties under contention. Overall, numaPTE provides scalable, locality-aware memory-management support that preserves performance as NUMA scales to many sockets.

Abstract

Memory management operations that modify page-tables, typically performed during memory allocation/deallocation, are infamous for their poor performance in highly threaded applications, largely due to process-wide TLB shootdowns that the OS must issue due to the lack of hardware support for TLB coherence. We study these operations in NUMA settings, where we observe up to 40x overhead for basic operations such as munmap or mprotect. The overhead further increases if page-table replication is used, where complete coherent copies of the page-tables are maintained across all NUMA nodes. While eager system-wide replication is extremely effective at localizing page-table reads during address translation, we find that it creates additional penalties upon any page-table changes due to the need to maintain all replicas coherent. In this paper, we propose a novel page-table management mechanism, called numaPTE, to enable transparent, on-demand, and partial page-table replication across NUMA nodes in order to perform address translation locally, while avoiding the overheads and scalability issues of system-wide full page-table replication. We then show that numaPTE's precise knowledge of page-table sharers can be leveraged to significantly reduce the number of TLB shootdowns issued upon any memory-management operation. As a result, numaPTE not only avoids replication-related slowdowns, but also provides significant speedup over the baseline on memory allocation/deallocation and access control operations. We implement numaPTEin Linux on x86_64, evaluate it on 4- and 8-socket systems, and show that numaPTE achieves the full benefits of eager page-table replication on a wide range of applications, while also achieving a 12% and 36% runtime improvement on Webserver and Memcached respectively due to a significant reduction in TLB shootdowns.
Paper Structure (29 sections, 14 figures, 4 tables)

This paper contains 29 sections, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Impact of page-table replication, and the TLB shootdown optimization on mprotect. numaPTE reduces the run time slow down by up to 40x, and mitigates the TLB shootdown overhead by leveraging the information about page tables on each socket. All values in both plots are normalized to the baseline Linux v4.17 without replication.
  • Figure 2: a) The slowdown Linux on mprotect with local threads spinning on the local vs. remote sockets, b) The slowdown of Mitosis and numaPTE over Linux when the range of mprotect is 512KB; note that Mitosis sees a slowdown, while numaPTE experiences a speedup.
  • Figure 3: Impact of data and page-table placement on performance of various applications; L - local, R - remote, P - page-tables, D - data, I - interference of other applications on inter-socket traffic. The impact of page-table walks on the run time is significantly high often higher than data access. Detailed settings are show in Table \ref{['tab:workload-config']}.
  • Figure 4: An abstract Illustration of replication of hierarchical page-tables on Mitosis (a) and numaPTE (b). Different memory allocations (VMAs) are colored differently. Mitosis eagerly replicates allocated page-tables on all NUMA nodes (sockets), whereas numaPTE performs lazy and partial replication on-demand simultaneously instead.
  • Figure 5: (a) Prefetching with a degree of 1 with PTE $X+1$ prefetched. (b) Maximum degree prefetching when the target page-table covers multiple VMAs is limited by both the page-table boundaries and boundaries of the encompassing VMA.
  • ...and 9 more figures