Inclusive-PIM: Hardware-Software Co-design for Broad Acceleration on Commercial PIM Architectures
Johnathan Alsop, Shaizeen Aga, Mohamed Ibrahim, Mahzabeen Islam, Andrew Mccrabb, Nuwan Jayasena
TL;DR
This paper investigates whether commercially available PIM designs, optimized for ML, can broadly accelerate primitives across domains. It introduces the PIM-amenability-test to guide data placement and compute offload, then maps wave simulation, sparse skinny GEMMs, and push-based graph computations to a strawman PIM, revealing bottlenecks such as row-activation overheads and cache-reuse limitations. To address these issues, the authors propose architecture-aware, sparsity-aware, and cache-aware hardware-software optimizations, along with limit studies, demonstrating potential performance improvements over baseline GPUs. The work argues for inclusive PIM designs and hardware-software co-design to realize broad, memory-bandwidth-driven acceleration across ML and non-ML workloads, with practical implications for next-generation memory-centric architectures.
Abstract
Continual demand for memory bandwidth has made it worthwhile for memory vendors to reassess processing in memory (PIM), which enables higher bandwidth by placing compute units in/near-memory. As such, memory vendors have recently proposed commercially viable PIM designs. However, these proposals are largely driven by the needs of (a narrow set of) machine learning (ML) primitives. While such proposals are reasonable given the the growing importance of ML, as memory is a pervasive component, %in this work, we make there is a case for a more inclusive PIM design that can accelerate primitives across domains. In this work, we ascertain the capabilities of commercial PIM proposals to accelerate various primitives across domains. We first begin with outlining a set of characteristics, termed PIM-amenability-test, which aid in assessing if a given primitive is likely to be accelerated by PIM. Next, we apply this test to primitives under study to ascertain efficient data-placement and orchestration to map the primitives to underlying PIM architecture. We observe here that, even though primitives under study are largely PIM-amenable, existing commercial PIM proposals do not realize their performance potential for these primitives. To address this, we identify bottlenecks that arise in PIM execution and propose hardware and software optimizations which stand to broaden the acceleration reach of commercial PIM designs (improving average PIM speedups from 1.12x to 2.49x relative to a GPU baseline). Overall, while we believe emerging commercial PIM proposals add a necessary and complementary design point in the application acceleration space, hardware-software co-design is necessary to deliver their benefits broadly.
