Table of Contents
Fetching ...

TDRAM: Tag-enhanced DRAM for Efficient Caching

Maryam Babaie, Ayaz Akram, Wendy Elsasser, Brent Haukness, Michael Miller, Taeksang Song, Thomas Vogelsang, Steven Woo, Jason Lowe-Power

TL;DR

TDRAM tackles the scalability gap of SRAM caches by designing a tag-enhanced DRAM cache integrated on the same die as data, enabling on-die tag checks and conditional data transfers. It adds an HM bus, ActRd/ActWr commands, and a flush buffer to decouple tag processing from data movement, while supporting early tag probing to reduce miss penalties. The approach yields substantial improvements: around 2.6x faster tag checks, 1.2x system speedup, and about 21% energy savings versus state-of-the-art designs, with robust performance in HPC workloads and favorable behavior in disaggregated memory scenarios. Overall, TDRAM delivers scalable, energy-efficient DRAM caching that narrows the gap between LLC caches and remote memory in heterogeneous memory architectures.

Abstract

As SRAM-based caches are hitting a scaling wall, manufacturers are integrating DRAM-based caches into system designs to continue increasing cache sizes. While DRAM caches can improve the performance of memory systems, existing DRAM cache designs suffer from high miss penalties, wasted data movement, and interference between misses and demand requests. In this paper, we propose TDRAM, a novel DRAM microarchitecture tailored for caching. TDRAM enhances HBM3 by adding a set of small low-latency mats to store tags and metadata on the same die as the data mats. These mats enable fast parallel tag and data access, on-DRAM-die tag comparison, and conditional data response based on comparison result (reducing wasted data transfers) akin to SRAM caches mechanism. TDRAM further optimizes the hit and miss latencies by performing opportunistic early tag probing. Moreover, TDRAM introduces a flush buffer to store conflicting dirty data on write misses, eliminating turnaround delays on data bus. We evaluate TDRAM using a full-system simulator and a set of HPC workloads with large memory footprints showing TDRAM provides at least 2.6$\times$ faster tag check, 1.2$\times$ speedup, and 21% less energy consumption, compared to the state-of-the-art commercial and research designs.

TDRAM: Tag-enhanced DRAM for Efficient Caching

TL;DR

TDRAM tackles the scalability gap of SRAM caches by designing a tag-enhanced DRAM cache integrated on the same die as data, enabling on-die tag checks and conditional data transfers. It adds an HM bus, ActRd/ActWr commands, and a flush buffer to decouple tag processing from data movement, while supporting early tag probing to reduce miss penalties. The approach yields substantial improvements: around 2.6x faster tag checks, 1.2x system speedup, and about 21% energy savings versus state-of-the-art designs, with robust performance in HPC workloads and favorable behavior in disaggregated memory scenarios. Overall, TDRAM delivers scalable, energy-efficient DRAM caching that narrows the gap between LLC caches and remote memory in heterogeneous memory architectures.

Abstract

As SRAM-based caches are hitting a scaling wall, manufacturers are integrating DRAM-based caches into system designs to continue increasing cache sizes. While DRAM caches can improve the performance of memory systems, existing DRAM cache designs suffer from high miss penalties, wasted data movement, and interference between misses and demand requests. In this paper, we propose TDRAM, a novel DRAM microarchitecture tailored for caching. TDRAM enhances HBM3 by adding a set of small low-latency mats to store tags and metadata on the same die as the data mats. These mats enable fast parallel tag and data access, on-DRAM-die tag comparison, and conditional data response based on comparison result (reducing wasted data transfers) akin to SRAM caches mechanism. TDRAM further optimizes the hit and miss latencies by performing opportunistic early tag probing. Moreover, TDRAM introduces a flush buffer to store conflicting dirty data on write misses, eliminating turnaround delays on data bus. We evaluate TDRAM using a full-system simulator and a set of HPC workloads with large memory footprints showing TDRAM provides at least 2.6 faster tag check, 1.2 speedup, and 21% less energy consumption, compared to the state-of-the-art commercial and research designs.
Paper Structure (36 sections, 15 figures, 4 tables)

This paper contains 36 sections, 15 figures, 4 tables.

Figures (15)

  • Figure 1: The breakdown of hit and miss ratios of DRAM cache. The letters show high or low miss ratio.
  • Figure 2: The average queueing delay of read demands in the read buffer of controller, in Intel's Cascade Lake and Alloy DRAM caches, compared to the system having a main memory only (no DRAM cache). This time marks the waiting time requests spend in the buffer before accessing the memory.
  • Figure 3: Intel's Cascade Lake commericial and Alloy DRAM caches bandwidth, broken to useful and unuseful data movement, normalized to total system bandwidth. In all read/write misses to a clean line and write hits, after tag comparison (which also retreives data) the controller immediately discards the data (serving no purpose), shown as unuseful. Alloy has a longer burst length than Cascade Lake, which increases the unuseful data movement.
  • Figure 4: TDRAM's architecture and bank organization.
  • Figure 5: Timing transactions of a read operations in TDRAM. The timing is the same for a read miss dirty.
  • ...and 10 more figures