Table of Contents
Fetching ...

PIM-CACHE: High-Efficiency Content-Aware Copy for Processing-In-Memory

Peterson Yuhala, Mpoki Mwaisela, Pascal Felber, Valerio Schiavoni

Abstract

Processing-in-memory (PIM) architectures bring computation closer to data, reducing the processor-memory transfer bottleneck in traditional processor-centric designs. Novel hardware solutions, such as UPMEM's in-memory processing technology, achieve this by integrating low-power DRAM processing units (DPUs) into memory DIMMs, enabling massive parallelism and improved memory bandwidth. However, paradoxically, these PIM architectures introduce mandatory coarse-grained data transfers between host DRAM and DPUs, which often become the new bottleneck. We present PIM-CACHE, a lightweight data staging layer that dynamically eliminates redundant data transfers to PIM DPUs by exploiting workload similarity, achieving content-aware copy (CAC). We evaluate PIM-CACHE on both synthetic workloads and real-world genome datasets, demonstrating its effectiveness in reducing PIM data transfer overhead.

PIM-CACHE: High-Efficiency Content-Aware Copy for Processing-In-Memory

Abstract

Processing-in-memory (PIM) architectures bring computation closer to data, reducing the processor-memory transfer bottleneck in traditional processor-centric designs. Novel hardware solutions, such as UPMEM's in-memory processing technology, achieve this by integrating low-power DRAM processing units (DPUs) into memory DIMMs, enabling massive parallelism and improved memory bandwidth. However, paradoxically, these PIM architectures introduce mandatory coarse-grained data transfers between host DRAM and DPUs, which often become the new bottleneck. We present PIM-CACHE, a lightweight data staging layer that dynamically eliminates redundant data transfers to PIM DPUs by exploiting workload similarity, achieving content-aware copy (CAC). We evaluate PIM-CACHE on both synthetic workloads and real-world genome datasets, demonstrating its effectiveness in reducing PIM data transfer overhead.
Paper Structure (20 sections, 9 figures, 1 table)

This paper contains 20 sections, 9 figures, 1 table.

Figures (9)

  • Figure 1: Total execution times for vector addition on two vectors of varying sizes, along with the total host-to-DPU copy time. Note: The copy time includes the total time required to transfer both input buffers of the corresponding size from host DRAM to DPU MRAM.
  • Figure 2: Architecture of a UPMEM-PIM enabled system.
  • Figure 3: Content-aware copy design.
  • Figure 4: Overhead of DRM operations with varying number of DRM threads, hash table, and data sizes.
  • Figure 5: Host to DPU data transfer overhead with CAC and without CAC (naive) using synthetic workloads with varying degrees of spatial redundancy. The least redundant workload is $R=0$ while the most redundant is $R=1$. As spatial redundancy increases (from left to right), the benefits of CAC become more apparent. We use 256 DPUs.
  • ...and 4 more figures