Table of Contents
Fetching ...

Bandwidth-Effective DRAM Cache for GPUs with Storage-Class Memory

Jeongmin Hong, Sungjun Cho, Geonwoo Park, Wonhyuk Yang, Young-Ho Gong, Gwangsun Kim

TL;DR

The paper tackles the memory capacity wall in GPUs by combining Storage-Class Memory (SCM) with a HW-managed DRAM cache in a Heterogeneous Memory Stack (HMS). It introduces three core innovations: Aggregated Metadata-In-Last-Column (AMIL) to minimize tag probes while preserving ECC, a Configurable Tag Cache (CTC) to share L2 cache capacity with DRAM-cache tags, and an SCM-aware DRAM cache bypass policy that uses an SCM Penalty Score and a DRAM-Affinity Score to decide when data should bypass the DRAM cache. Together with power management and the ability to operate SCM in different modes (SLC/MLC), the design achieves substantial performance and energy gains, outperforming oversubscribed HBM and previous DRAM-cache approaches. The evaluation across diverse workloads shows up to $12.5\times$ performance improvement over oversubscribed HBM (and $2.9\times$ average) with significant reductions in tag probes and SCM writes, along with manageable hardware overhead and safe thermal behavior. Overall, the work demonstrates a practical, scalable path to dramatically increase GPU memory capacity and bandwidth using HMS with SCM-backed memory stacks.

Abstract

We propose overcoming the memory capacity limitation of GPUs with high-capacity Storage-Class Memory (SCM) and DRAM cache. By significantly increasing the memory capacity with SCM, the GPU can capture a larger fraction of the memory footprint than HBM for workloads that oversubscribe memory, achieving high speedups. However, the DRAM cache needs to be carefully designed to address the latency and BW limitations of the SCM while minimizing cost overhead and considering GPU's characteristics. Because the massive number of GPU threads can thrash the DRAM cache, we first propose an SCM-aware DRAM cache bypass policy for GPUs that considers the multi-dimensional characteristics of memory accesses by GPUs with SCM to bypass DRAM for data with low performance utility. In addition, to reduce DRAM cache probes and increase effective DRAM BW with minimal cost, we propose a Configurable Tag Cache (CTC) that repurposes part of the L2 cache to cache DRAM cacheline tags. The L2 capacity used for the CTC can be adjusted by users for adaptability. Furthermore, to minimize DRAM cache probe traffic from CTC misses, our Aggregated Metadata-In-Last-column (AMIL) DRAM cache organization co-locates all DRAM cacheline tags in a single column within a row. The AMIL also retains the full ECC protection, unlike prior DRAM cache's Tag-And-Data (TAD) organization. Additionally, we propose SCM throttling to curtail power and exploiting SCM's SLC/MLC modes to adapt to workload's memory footprint. While our techniques can be used for different DRAM and SCM devices, we focus on a Heterogeneous Memory Stack (HMS) organization that stacks SCM dies on top of DRAM dies for high performance. Compared to HBM, HMS improves performance by up to 12.5x (2.9x overall) and reduces energy by up to 89.3% (48.1% overall). Compared to prior works, we reduce DRAM cache probe and SCM write traffic by 91-93% and 57-75%, respectively.

Bandwidth-Effective DRAM Cache for GPUs with Storage-Class Memory

TL;DR

The paper tackles the memory capacity wall in GPUs by combining Storage-Class Memory (SCM) with a HW-managed DRAM cache in a Heterogeneous Memory Stack (HMS). It introduces three core innovations: Aggregated Metadata-In-Last-Column (AMIL) to minimize tag probes while preserving ECC, a Configurable Tag Cache (CTC) to share L2 cache capacity with DRAM-cache tags, and an SCM-aware DRAM cache bypass policy that uses an SCM Penalty Score and a DRAM-Affinity Score to decide when data should bypass the DRAM cache. Together with power management and the ability to operate SCM in different modes (SLC/MLC), the design achieves substantial performance and energy gains, outperforming oversubscribed HBM and previous DRAM-cache approaches. The evaluation across diverse workloads shows up to performance improvement over oversubscribed HBM (and average) with significant reductions in tag probes and SCM writes, along with manageable hardware overhead and safe thermal behavior. Overall, the work demonstrates a practical, scalable path to dramatically increase GPU memory capacity and bandwidth using HMS with SCM-backed memory stacks.

Abstract

We propose overcoming the memory capacity limitation of GPUs with high-capacity Storage-Class Memory (SCM) and DRAM cache. By significantly increasing the memory capacity with SCM, the GPU can capture a larger fraction of the memory footprint than HBM for workloads that oversubscribe memory, achieving high speedups. However, the DRAM cache needs to be carefully designed to address the latency and BW limitations of the SCM while minimizing cost overhead and considering GPU's characteristics. Because the massive number of GPU threads can thrash the DRAM cache, we first propose an SCM-aware DRAM cache bypass policy for GPUs that considers the multi-dimensional characteristics of memory accesses by GPUs with SCM to bypass DRAM for data with low performance utility. In addition, to reduce DRAM cache probes and increase effective DRAM BW with minimal cost, we propose a Configurable Tag Cache (CTC) that repurposes part of the L2 cache to cache DRAM cacheline tags. The L2 capacity used for the CTC can be adjusted by users for adaptability. Furthermore, to minimize DRAM cache probe traffic from CTC misses, our Aggregated Metadata-In-Last-column (AMIL) DRAM cache organization co-locates all DRAM cacheline tags in a single column within a row. The AMIL also retains the full ECC protection, unlike prior DRAM cache's Tag-And-Data (TAD) organization. Additionally, we propose SCM throttling to curtail power and exploiting SCM's SLC/MLC modes to adapt to workload's memory footprint. While our techniques can be used for different DRAM and SCM devices, we focus on a Heterogeneous Memory Stack (HMS) organization that stacks SCM dies on top of DRAM dies for high performance. Compared to HBM, HMS improves performance by up to 12.5x (2.9x overall) and reduces energy by up to 89.3% (48.1% overall). Compared to prior works, we reduce DRAM cache probe and SCM write traffic by 91-93% and 57-75%, respectively.
Paper Structure (29 sections, 1 equation, 22 figures, 2 tables)

This paper contains 29 sections, 1 equation, 22 figures, 2 tables.

Figures (22)

  • Figure 1: (a) Improvement in compute throughput (with Tensor Core tensor_core and Matrix Core matrix_core when applicable) and memory capacity of GPUs over time h100a100v100p100mi25mi100mi250xmi300x. (b) Cost-effectiveness of different GPU architectures under memory oversubscription.
  • Figure 2: (a) Graph500 benchmark's data size example graph500. (b) Row buffer locality (defined as the average number of column accesses per row activation) of representative workloads.
  • Figure 3: (Left) Validation of UM simulation at a fixed oversubscription ratio (log scale plot). (Right) Workloads' memory footprints used for validation.
  • Figure 4: Memory channel BW utilization from memory devices with HBM organization for synthetic access patterns (configuration details in § \ref{['sec:methodology']}).
  • Figure 5: Design space of a GPU with SCM and DRAM cache with (a) 3D-stacked DRAM and SCM and (b) separate DRAM and SCM stacks.
  • ...and 17 more figures