Table of Contents
Fetching ...

Performance Characterization of AutoNUMA Memory Tiering on Graph Analytics

Diego Moura, Vinicius Petrucci, Daniel Mosse

TL;DR

Graph analytics on DRAM+NVM systems reveal AutoNUMA's limitations due to irregular access patterns, with many pages touched only once. The authors profile memory behavior using perf-mem, track allocations with mmap interception, and map samples to memory objects to enable object-level analysis. They show that object-level memory tiering can outperform AutoNUMA, achieving average improvements around 21% and up to 51% in execution time by reducing NVM accesses (e.g., bc_kron). This work demonstrates the potential of application-aware, object-level memory management to unlock more effective use of heterogeneous memories in graph workloads.

Abstract

Non-Volatile Memory (NVM) can deliver higher density and lower cost per bit when compared with DRAM. Its main drawback is that it is slower than DRAM. On the other hand, DRAM has scalability problems due to its cost and energy consumption. NVM will likely coexist with DRAM in computer systems and the biggest challenge is to know which data to allocate on each type of memory. A state-of-the-art approach is AutoNUMA, in the Linux kernel. Prior work is limited to measuring AutoNUMA solely in terms of the application execution time, without understanding AutoNUMA's behavior. In this work we provide a more in-depth characterization of AutoNUMA, for instance, identifying where exactly a set of pages are allocated, while keeping track of promotion and demotion decisions performed by AutoNUMA. Our analysis shows that AutoNUMA's benefits can be modest when running graph processing applications, or graph analytics, because most pages have only one access over the entire execution time and other pages accesses have no temporal locality. We make a case for exploring application characteristics using object-level mappings between DRAM and NVM. Our preliminary experiments show that an object-level memory tiering can better capture the application behavior and reduce the execution time of graph analytics by 21% (avg) and 51% (max), when compared to AutoNUMA, while significantly reducing the number of memory accesses in NVM.

Performance Characterization of AutoNUMA Memory Tiering on Graph Analytics

TL;DR

Graph analytics on DRAM+NVM systems reveal AutoNUMA's limitations due to irregular access patterns, with many pages touched only once. The authors profile memory behavior using perf-mem, track allocations with mmap interception, and map samples to memory objects to enable object-level analysis. They show that object-level memory tiering can outperform AutoNUMA, achieving average improvements around 21% and up to 51% in execution time by reducing NVM accesses (e.g., bc_kron). This work demonstrates the potential of application-aware, object-level memory management to unlock more effective use of heterogeneous memories in graph workloads.

Abstract

Non-Volatile Memory (NVM) can deliver higher density and lower cost per bit when compared with DRAM. Its main drawback is that it is slower than DRAM. On the other hand, DRAM has scalability problems due to its cost and energy consumption. NVM will likely coexist with DRAM in computer systems and the biggest challenge is to know which data to allocate on each type of memory. A state-of-the-art approach is AutoNUMA, in the Linux kernel. Prior work is limited to measuring AutoNUMA solely in terms of the application execution time, without understanding AutoNUMA's behavior. In this work we provide a more in-depth characterization of AutoNUMA, for instance, identifying where exactly a set of pages are allocated, while keeping track of promotion and demotion decisions performed by AutoNUMA. Our analysis shows that AutoNUMA's benefits can be modest when running graph processing applications, or graph analytics, because most pages have only one access over the entire execution time and other pages accesses have no temporal locality. We make a case for exploring application characteristics using object-level mappings between DRAM and NVM. Our preliminary experiments show that an object-level memory tiering can better capture the application behavior and reduce the execution time of graph analytics by 21% (avg) and 51% (max), when compared to AutoNUMA, while significantly reducing the number of memory accesses in NVM.
Paper Structure (37 sections, 11 figures, 3 tables)

This paper contains 37 sections, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Usage of Non-Volatile Memory in a system: Memory Mode versus App Direct Mode.
  • Figure 2: Profiling workflow for memory object characterization.
  • Figure 3: Percentage of memory samples mapped to DRAM and NVM for different graph applications/datasets.
  • Figure 4: Percentage of page accesses with 1, 2, or 3+ touches for different applications/datasets (each access occurred external to caches).
  • Figure 5: Statistics of page reuse in time. We consider pages touched exactly twice and associated with the most accessed memory object allocated on NVM.
  • ...and 6 more figures