Table of Contents
Fetching ...

A$^3$PIM: An Automated, Analytic and Accurate Processing-in-Memory Offloader

Qingcai Jiang, Shaojie Tan, Junshi Chen, Hong An

TL;DR

A$^3$PIM tackles the memory-CPU bottleneck by delivering an automated, static-analysis–driven offloader that partitions code between CPU and PIM to minimize data movement. The approach combines a static code analyzer, a connectivity-based clustering scheme, and cluster-level intrinsic statistics to decide offloading without runtime profiling. Empirical results on GAP and PrIM show average speedups of $2.63\times$ (CPU-only) and $4.45\times$ (PIM-only), with basic-block granularity nearly reaching a theoretical upper bound of $4.56\times$ over PIM-only. This work demonstrates that static, data-movement-aware scheduling can unlock substantial PIM benefits in heterogeneous CPU-PIM systems, offering practical path to efficient near-term offloading. Key contributions include a formal cost model, a two-stage clustering/offloading algorithm, and a comprehensive evaluation framework with real workloads.

Abstract

The performance gap between memory and processor has grown rapidly. Consequently, the energy and wall-clock time costs associated with moving data between the CPU and main memory predominate the overall computational cost. The Processing-in-Memory (PIM) paradigm emerges as a promising architecture that mitigates the need for extensive data movements by strategically positioning computing units proximate to the memory. Despite the abundant efforts devoted to building a robust and highly-available PIM system, identifying PIM-friendly segments of applications poses significant challenges due to the lack of a comprehensive tool to evaluate the intrinsic memory access pattern of the segment. To tackle this challenge, we propose A$^3$PIM: an Automated, Analytic and Accurate Processing-in-Memory offloader. We systematically consider the cross-segment data movement and the intrinsic memory access pattern of each code segment via static code analyzer. We evaluate A$^3$PIM across a wide range of real-world workloads including GAP and PrIM benchmarks and achieve an average speedup of 2.63x and 4.45x (up to 7.14x and 10.64x) when compared to CPU-only and PIM-only executions, respectively.

A$^3$PIM: An Automated, Analytic and Accurate Processing-in-Memory Offloader

TL;DR

APIM tackles the memory-CPU bottleneck by delivering an automated, static-analysis–driven offloader that partitions code between CPU and PIM to minimize data movement. The approach combines a static code analyzer, a connectivity-based clustering scheme, and cluster-level intrinsic statistics to decide offloading without runtime profiling. Empirical results on GAP and PrIM show average speedups of (CPU-only) and (PIM-only), with basic-block granularity nearly reaching a theoretical upper bound of over PIM-only. This work demonstrates that static, data-movement-aware scheduling can unlock substantial PIM benefits in heterogeneous CPU-PIM systems, offering practical path to efficient near-term offloading. Key contributions include a formal cost model, a two-stage clustering/offloading algorithm, and a comprehensive evaluation framework with real workloads.

Abstract

The performance gap between memory and processor has grown rapidly. Consequently, the energy and wall-clock time costs associated with moving data between the CPU and main memory predominate the overall computational cost. The Processing-in-Memory (PIM) paradigm emerges as a promising architecture that mitigates the need for extensive data movements by strategically positioning computing units proximate to the memory. Despite the abundant efforts devoted to building a robust and highly-available PIM system, identifying PIM-friendly segments of applications poses significant challenges due to the lack of a comprehensive tool to evaluate the intrinsic memory access pattern of the segment. To tackle this challenge, we propose APIM: an Automated, Analytic and Accurate Processing-in-Memory offloader. We systematically consider the cross-segment data movement and the intrinsic memory access pattern of each code segment via static code analyzer. We evaluate APIM across a wide range of real-world workloads including GAP and PrIM benchmarks and achieve an average speedup of 2.63x and 4.45x (up to 7.14x and 10.64x) when compared to CPU-only and PIM-only executions, respectively.
Paper Structure (19 sections, 2 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 19 sections, 2 equations, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: High-level PIM architecture
  • Figure 2: Example of (a) Cache Line Data Movement (CL-DM) and Corresponding (b) Context Switch Graph in Continuous ARM64 Assembly Code.
  • Figure 3: A$^3$PIM overview
  • Figure 4: Execution time breakdown of GAP and PrIM workloads using different offloading decisions