Table of Contents
Fetching ...

Characterizing the impact of last-level cache replacement policies on big-data workloads

Alexandre Valentin Jamet, Lluc Alvarez, Marc Casas

TL;DR

Graph-processing workloads face severe memory bottlenecks, with irregular access patterns causing up to $80\%$ of runtime to be spent waiting for memory. The authors compare six LLC replacement policies—SRRIP, DRRIP, SHiP, Hawkeye, Glider, and MPPPB—using ChampSim on a Cascade Lake–style model to determine their effectiveness on these workloads. They find high Misses-Per-Kilo-Instructions across the cache hierarchy ($L1D: 53.2$, $L2C: 44.2$, $LLC: 41.8$) and that $78.6\%$ of L1D misses reach DRAM, with sophisticated policies failing to generalize beyond SPEC benchmarks due to limited PC-to-address correlations in graph workloads. The study concludes that, despite their hardware complexity, modern cache-replacement policies yield limited gains over an LRU baseline for graph-processing workloads, highlighting the need for alternative memory-system strategies tailored to irregular memory access patterns.

Abstract

In recent years, graph-processing has become an essential class of workloads with applications in a rapidly growing number of fields. Graph-processing typically uses large input sets, often in multi-gigabyte scale, and data-dependent graph traversal methods exhibiting irregular memory access patterns. Recent work demonstrates that, due to the highly irregular memory access patterns of data-dependent graph traversals, state-of-the-art graph-processing workloads spend up to 80 % of the total execution time waiting for memory accesses to be served by the DRAM. The vast disparity between the Last Level Cache (LLC) and main memory latencies is a problem that has been addressed for years in computer architecture. One of the prevailing approaches when it comes to mitigating this performance gap between modern CPUs and DRAM is cache replacement policies. In this work, we characterize the challenges drawn by graph-processing workloads and evaluate the most relevant cache replacement policies.

Characterizing the impact of last-level cache replacement policies on big-data workloads

TL;DR

Graph-processing workloads face severe memory bottlenecks, with irregular access patterns causing up to of runtime to be spent waiting for memory. The authors compare six LLC replacement policies—SRRIP, DRRIP, SHiP, Hawkeye, Glider, and MPPPB—using ChampSim on a Cascade Lake–style model to determine their effectiveness on these workloads. They find high Misses-Per-Kilo-Instructions across the cache hierarchy (, , ) and that of L1D misses reach DRAM, with sophisticated policies failing to generalize beyond SPEC benchmarks due to limited PC-to-address correlations in graph workloads. The study concludes that, despite their hardware complexity, modern cache-replacement policies yield limited gains over an LRU baseline for graph-processing workloads, highlighting the need for alternative memory-system strategies tailored to irregular memory access patterns.

Abstract

In recent years, graph-processing has become an essential class of workloads with applications in a rapidly growing number of fields. Graph-processing typically uses large input sets, often in multi-gigabyte scale, and data-dependent graph traversal methods exhibiting irregular memory access patterns. Recent work demonstrates that, due to the highly irregular memory access patterns of data-dependent graph traversals, state-of-the-art graph-processing workloads spend up to 80 % of the total execution time waiting for memory accesses to be served by the DRAM. The vast disparity between the Last Level Cache (LLC) and main memory latencies is a problem that has been addressed for years in computer architecture. One of the prevailing approaches when it comes to mitigating this performance gap between modern CPUs and DRAM is cache replacement policies. In this work, we characterize the challenges drawn by graph-processing workloads and evaluate the most relevant cache replacement policies.
Paper Structure (7 sections, 3 figures)

This paper contains 7 sections, 3 figures.

Figures (3)

  • Figure 1: Example of a graph representation in memory using the CSR/CSC formats.
  • Figure 2: Misses-Per-Kilo-Instruction (MPKI) across the different levels of the cache hierarchy triggered by graph-processing workloads.
  • Figure 3: Geometric mean speed-up over LRU of state-of-the-art LLC replacement policies for the different benchmark suites.