PIM-CACHE: High-Efficiency Content-Aware Copy for Processing-In-Memory

Peterson Yuhala; Mpoki Mwaisela; Pascal Felber; Valerio Schiavoni

PIM-CACHE: High-Efficiency Content-Aware Copy for Processing-In-Memory

Peterson Yuhala, Mpoki Mwaisela, Pascal Felber, Valerio Schiavoni

Abstract

Processing-in-memory (PIM) architectures bring computation closer to data, reducing the processor-memory transfer bottleneck in traditional processor-centric designs. Novel hardware solutions, such as UPMEM's in-memory processing technology, achieve this by integrating low-power DRAM processing units (DPUs) into memory DIMMs, enabling massive parallelism and improved memory bandwidth. However, paradoxically, these PIM architectures introduce mandatory coarse-grained data transfers between host DRAM and DPUs, which often become the new bottleneck. We present PIM-CACHE, a lightweight data staging layer that dynamically eliminates redundant data transfers to PIM DPUs by exploiting workload similarity, achieving content-aware copy (CAC). We evaluate PIM-CACHE on both synthetic workloads and real-world genome datasets, demonstrating its effectiveness in reducing PIM data transfer overhead.

PIM-CACHE: High-Efficiency Content-Aware Copy for Processing-In-Memory

Abstract

Paper Structure (20 sections, 9 figures, 1 table)

This paper contains 20 sections, 9 figures, 1 table.

Introduction
Background
Processing-in-memory
Deduplication and compression
Motivation
System design
Challenges
Content-aware copying to PIM DPUs
Implementation
Evaluation
Experimental setup
DRM processing overhead
Impact of deduplication on copy overhead
Effect of compression on data transfer overhead
Choice of block size and fingerprinting algorithm
...and 5 more sections

Figures (9)

Figure 1: Total execution times for vector addition on two vectors of varying sizes, along with the total host-to-DPU copy time. Note: The copy time includes the total time required to transfer both input buffers of the corresponding size from host DRAM to DPU MRAM.
Figure 2: Architecture of a UPMEM-PIM enabled system.
Figure 3: Content-aware copy design.
Figure 4: Overhead of DRM operations with varying number of DRM threads, hash table, and data sizes.
Figure 5: Host to DPU data transfer overhead with CAC and without CAC (naive) using synthetic workloads with varying degrees of spatial redundancy. The least redundant workload is $R=0$ while the most redundant is $R=1$. As spatial redundancy increases (from left to right), the benefits of CAC become more apparent. We use 256 DPUs.
...and 4 more figures

PIM-CACHE: High-Efficiency Content-Aware Copy for Processing-In-Memory

Abstract

PIM-CACHE: High-Efficiency Content-Aware Copy for Processing-In-Memory

Authors

Abstract

Table of Contents

Figures (9)