Table of Contents
Fetching ...

DL-PIM: Improving Data Locality in Processing-in-Memory Systems

Parker Hao Tian, Zahra Yousefijamarani, Alaa Alameldeen

TL;DR

DL-PIM tackles data movement overhead in processing-in-memory by dynamically relocating frequently accessed blocks to a local reserved space and using a distributed subscription table to redirect requests. It introduces adaptive policies (hops-based and latency-based) plus a central global coordination to avoid cross-vault imbalances. In HMC, it achieves about 54% reduction in average memory latency per request and up to 15% workload speedup for data-reuse heavy workloads; in HBM, latency reductions reach 50% with 3-5% speedups overall. These results illustrate that data locality, managed entirely in hardware, can yield meaningful performance and energy benefits for PIM systems, with adaptive mechanisms mitigating indirection overhead.

Abstract

PIM architectures aim to reduce data transfer costs between processors and memory by integrating processing units within memory layers. Prior PIM architectures have shown potential to improve energy efficiency and performance. However, such advantages rely on data proximity to the processing units performing computations. Data movement overheads can degrade PIM's performance and energy efficiency due to the need to move data between a processing unit and a distant memory location. %they face challenges due to the overhead of transferring data from remote memory locations to processing units inside memory for computation. In this paper, we demonstrate that a large fraction of PIM's latency per memory request is attributed to data transfers and queuing delays from remote memory accesses. To improve PIM's data locality, we propose DL-PIM, a novel architecture that dynamically detects the overhead of data movement, and proactively moves data to a reserved area in the local memory of the requesting processing unit. DL-PIM uses a distributed address-indirection hardware lookup table to redirect traffic to the current data location. We propose DL-PIM implementations on two 3D stacked memories: HMC and HBM. While some workloads benefit from DL-PIM, others are negatively impacted by the additional latency due to indirection accesses. Therefore, we propose an adaptive mechanism that assesses the cost and benefit of indirection and dynamically enables or disables it to prevent degrading workloads that suffer from indirection. Overall, DL-PIM reduces the average memory latency per request by 54% in HMC and 50% in HBM which resulted in performance improvement of 15% for workloads with substantial data reuse in HMC and 5% in HBM. For all representative workloads, DL-PIM achieved a 6% speedup in HMC and a 3% speedup in HBM, showing that DL-PIM enhances data locality and overall system performance.

DL-PIM: Improving Data Locality in Processing-in-Memory Systems

TL;DR

DL-PIM tackles data movement overhead in processing-in-memory by dynamically relocating frequently accessed blocks to a local reserved space and using a distributed subscription table to redirect requests. It introduces adaptive policies (hops-based and latency-based) plus a central global coordination to avoid cross-vault imbalances. In HMC, it achieves about 54% reduction in average memory latency per request and up to 15% workload speedup for data-reuse heavy workloads; in HBM, latency reductions reach 50% with 3-5% speedups overall. These results illustrate that data locality, managed entirely in hardware, can yield meaningful performance and energy benefits for PIM systems, with adaptive mechanisms mitigating indirection overhead.

Abstract

PIM architectures aim to reduce data transfer costs between processors and memory by integrating processing units within memory layers. Prior PIM architectures have shown potential to improve energy efficiency and performance. However, such advantages rely on data proximity to the processing units performing computations. Data movement overheads can degrade PIM's performance and energy efficiency due to the need to move data between a processing unit and a distant memory location. %they face challenges due to the overhead of transferring data from remote memory locations to processing units inside memory for computation. In this paper, we demonstrate that a large fraction of PIM's latency per memory request is attributed to data transfers and queuing delays from remote memory accesses. To improve PIM's data locality, we propose DL-PIM, a novel architecture that dynamically detects the overhead of data movement, and proactively moves data to a reserved area in the local memory of the requesting processing unit. DL-PIM uses a distributed address-indirection hardware lookup table to redirect traffic to the current data location. We propose DL-PIM implementations on two 3D stacked memories: HMC and HBM. While some workloads benefit from DL-PIM, others are negatively impacted by the additional latency due to indirection accesses. Therefore, we propose an adaptive mechanism that assesses the cost and benefit of indirection and dynamically enables or disables it to prevent degrading workloads that suffer from indirection. Overall, DL-PIM reduces the average memory latency per request by 54% in HMC and 50% in HBM which resulted in performance improvement of 15% for workloads with substantial data reuse in HMC and 5% in HBM. For all representative workloads, DL-PIM achieved a 6% speedup in HMC and a 3% speedup in HBM, showing that DL-PIM enhances data locality and overall system performance.

Paper Structure

This paper contains 29 sections, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Breakdown of Memory Latency into data transfer latency, queuing delay and array access latency with HMC memory.
  • Figure 2: Breakdown of Memory Latency with HBM.
  • Figure 3: Coefficient of variation (CoV) for memory request distribution across workloads with HMC memory.
  • Figure 4: CoV for memory request distribution across workloads with HBM.
  • Figure 5: Representation of an HMC system with 16 vaults in each layer. Each set of partitions is connected to its corresponding vault logic, illustrated with different colors in the image.
  • ...and 11 more figures