Table of Contents
Fetching ...

PyPIM: Integrating Digital Processing-in-Memory from Microarchitectural Design to Python Tensors

Orian Leitersdorf, Ronny Ronen, Shahar Kvatinsky

TL;DR

PyPIM delivers an end-to-end, programmable stack for digital memristive PIM by tying a Python tensor API to a microarchitectural front end through a compact ISA and a flexible host driver. The approach enables seamless conversion of tensor-centric Python code into PIM-ready operations, leveraging partitioned crossbars, range-based masks, and inter-array communication to maximize parallelism. A GPU-accelerated, bit-accurate simulator validates correctness and demonstrates near-theoretical PIM throughput with modest driver overhead, while the development library and tensor-views abstractions simplify data alignment and inter-warp transfers. This work lowers the barrier to adopting PIM by providing familiar interfaces, portable abstractions, and an extensible software stack that can adapt to future digital PIM architectures. The practical impact is a more accessible, scalable path to high-throughput in-memory computing for data-intensive workloads.

Abstract

Digital processing-in-memory (PIM) architectures mitigate the memory wall problem by facilitating parallel bitwise operations directly within the memory. Recent works have demonstrated their algorithmic potential for accelerating data-intensive applications; however, there remains a significant gap in the programming model and microarchitectural design. This is further exacerbated by aspects unique to memristive PIM such as partitions and operations across both directions of the memory array. To address this gap, this paper provides an end-to-end architectural integration of digital memristive PIM from a high-level Python library for tensor operations (similar to NumPy and PyTorch) to the low-level microarchitectural design. We begin by proposing an efficient microarchitecture and instruction set architecture (ISA) that bridge the gap between the low-level control periphery and an abstraction of PIM parallelism. We subsequently propose a PIM development library that converts high-level Python to ISA instructions and a PIM driver that translates ISA instructions into PIM micro-operations. We evaluate PyPIM via a cycle-accurate simulator on a wide variety of benchmarks that both demonstrate the versatility of the Python library and the performance compared to theoretical PIM bounds. Overall, PyPIM drastically simplifies the development of PIM applications and enables the conversion of existing tensor-oriented Python programs to PIM with ease.

PyPIM: Integrating Digital Processing-in-Memory from Microarchitectural Design to Python Tensors

TL;DR

PyPIM delivers an end-to-end, programmable stack for digital memristive PIM by tying a Python tensor API to a microarchitectural front end through a compact ISA and a flexible host driver. The approach enables seamless conversion of tensor-centric Python code into PIM-ready operations, leveraging partitioned crossbars, range-based masks, and inter-array communication to maximize parallelism. A GPU-accelerated, bit-accurate simulator validates correctness and demonstrates near-theoretical PIM throughput with modest driver overhead, while the development library and tensor-views abstractions simplify data alignment and inter-warp transfers. This work lowers the barrier to adopting PIM by providing familiar interfaces, portable abstractions, and an extensible software stack that can adapt to future digital PIM architectures. The practical impact is a more accessible, scalable path to high-throughput in-memory computing for data-intensive workloads.

Abstract

Digital processing-in-memory (PIM) architectures mitigate the memory wall problem by facilitating parallel bitwise operations directly within the memory. Recent works have demonstrated their algorithmic potential for accelerating data-intensive applications; however, there remains a significant gap in the programming model and microarchitectural design. This is further exacerbated by aspects unique to memristive PIM such as partitions and operations across both directions of the memory array. To address this gap, this paper provides an end-to-end architectural integration of digital memristive PIM from a high-level Python library for tensor operations (similar to NumPy and PyTorch) to the low-level microarchitectural design. We begin by proposing an efficient microarchitecture and instruction set architecture (ISA) that bridge the gap between the low-level control periphery and an abstraction of PIM parallelism. We subsequently propose a PIM development library that converts high-level Python to ISA instructions and a PIM driver that translates ISA instructions into PIM micro-operations. We evaluate PyPIM via a cycle-accurate simulator on a wide variety of benchmarks that both demonstrate the versatility of the Python library and the performance compared to theoretical PIM bounds. Overall, PyPIM drastically simplifies the development of PIM applications and enables the conversion of existing tensor-oriented Python programs to PIM with ease.
Paper Structure (39 sections, 1 equation, 13 figures, 3 tables)

This paper contains 39 sections, 1 equation, 13 figures, 3 tables.

Figures (13)

  • Figure 1: (a) Majority logic ComputeDRAMSIMDRAMDRISAAmbit within (b) all rows of a DRAM subarray. (c) Stateful logic IMPLYFELIXMAGIC between memristors within (d) all rows of a crossbar array. Both support (e), an abstract model enabling arbitrary bitwise operations on columns. The figure is adapted from AritPIM AritPIM.
  • Figure 2: End-to-end integration from high-level Python to the proposed microarchitecture (arrows indicate runtime dependencies), thereby enabling the development and debugging of PIM applications. The Python library utilizes syntax similar to NumPy NumPy for vector arithmetic (e.g., $a * b + a$), read/write operations (e.g., $x[4] = 8.0$), indexing (e.g., $z[::2]$ selects all even indices), and general-purpose routines (e.g., $.sum()$ for aggregation).
  • Figure 3: (a) Stateful logic MemristiveLogic in the resistance domain between three memristors. (b) Parallel stateful logic in a crossbar array by applying ${V_1}$ and ${V_2}$ across bitlines while skipping a row, e.g., using ${V_{iso}}$.
  • Figure 4: (a) Bit-serial element-parallel arithmetic constructs vectored arithmetic from a serial sequence of logic gates that is performed in parallel across all rows (one gate per row in every cycle). Conversely, (b) the bit-parallel element-parallel approach stores the vectors in a strided format across ${N}$ partitions (each bit position in a different partition) and performs up to ${N}$ gates per row per cycle. Figure adapted from AritPIM AritPIM.
  • Figure 5: An overview of the different proposed micro-operation types.
  • ...and 8 more figures