PUMA: Efficient and Low-Cost Memory Allocation and Alignment Support for Processing-Using-Memory Architectures

Geraldo F. Oliveira; Emanuele G. Esposito; Juan Gómez-Luna; Onur Mutlu

PUMA: Efficient and Low-Cost Memory Allocation and Alignment Support for Processing-Using-Memory Architectures

Geraldo F. Oliveira, Emanuele G. Esposito, Juan Gómez-Luna, Onur Mutlu

TL;DR

The paper tackles the data layout and alignment constraints of Processing-Using-DRAM architectures, where operands must be co-located within a DRAM subarray and row-aligned, a requirement not met by standard memory allocators. It introduces PUMA, a kernel-based lazy allocator that uses DRAM topology, interleaving, and a huge-pages pool to split pages into fine-grained, subarray-local regions, accessible via pim_preallocate, pim_alloc, and pim_alloc_align. Evaluation on a QEMU-based RISC-V/Fedora platform demonstrates substantial performance improvements over malloc across multiple micro-benchmarks and allocation sizes, with larger gains for bigger allocations. This work presents a hardware-free memory allocator that enables efficient Utilization of Processing-Using-Memory substrates by aligning OS memory allocation with DRAM structure, potentially reducing data movement and enabling more operations to run in the memory substrate.

Abstract

Processing-using-DRAM (PUD) architectures impose a restrictive data layout and alignment for their operands, where source and destination operands (i) must reside in the same DRAM subarray (i.e., a group of DRAM rows sharing the same row buffer and row decoder) and (ii) are aligned to the boundaries of a DRAM row. However, standard memory allocation routines (i.e., malloc, posix_memalign, and huge pages-based memory allocation) fail to meet the data layout and alignment requirements for PUD architectures to operate successfully. To allow the memory allocation API to influence the OS memory allocator and ensure that memory objects are placed within specific DRAM subarrays, we propose a new lazy data allocation routine (in the kernel) for PUD memory objects called PUMA. The key idea of PUMA is to use the internal DRAM mapping information together with huge pages and then split huge pages into finer-grained allocation units that are (i) aligned to the page address and size and (ii) virtually contiguous. We implement PUMA as a kernel module using QEMU and emulate a RISC-V machine running Fedora 33 with v5.9.0 Linux Kernel. We emulate the implementation of a PUD system capable of executing row copy operations (as in RowClone) and Boolean AND/OR/NOT operations (as in Ambit). In our experiments, such an operation is performed in the host CPU if a given operation cannot be executed in our PUD substrate (due to data misalignment). PUMA significantly outperforms the baseline memory allocators for all evaluated microbenchmarks and allocation sizes.

PUMA: Efficient and Low-Cost Memory Allocation and Alignment Support for Processing-Using-Memory Architectures

TL;DR

Abstract

PUMA: Efficient and Low-Cost Memory Allocation and Alignment Support for Processing-Using-Memory Architectures

Authors

TL;DR

Abstract

Table of Contents

Figures (2)