No One-Size-Fits-All: A Workload-Driven Characterization of Bit-Parallel vs. Bit-Serial Data Layouts for Processing-using-Memory
Jingyao Zhang, Elaheh Sadredini
TL;DR
The paper tackles the memory‑wall bottleneck by showing that Bit‑Parallel and Bit‑Serial data layouts in Processing‑in‑Memory are not universally interchangeable. It builds iso‑area, cycle‑accurate models for BP and BS, and evaluates them across a two‑tier benchmark suite to reveal workload‑dependent performance, with up to a 14× variation and a hybrid path delivering up to a 2.66× speedup over the best static layout. A workload taxonomy is introduced to guide architects in selecting BP, BS, or hybrid schemes, and the work advocates low‑cost, fast transpose hardware combined with compilers that partition code by layout phase. Collectively, the results argue for workload‑aware, adaptable PIM systems rather than fixed data‑layout choices, enabling more efficient, next‑generation memory‑centric accelerators.
Abstract
Processing-in-Memory (PIM) is a promising approach to overcoming the memory-wall bottleneck. However, the PIM community has largely treated its two fundamental data layouts, Bit-Parallel (BP) and Bit-Serial (BS), as if they were interchangeable. This implicit "one-layout-fits-all" assumption, often hard-coded into existing evaluation frameworks, creates a critical gap: architects lack systematic, workload-driven guidelines for choosing the optimal data layout for their target applications. To address this gap, this paper presents the first systematic, workload-driven characterization of BP and BS PIM architectures. We develop iso-area, cycle-accurate BP and BS PIM architectural models and conduct a comprehensive evaluation using a diverse set of benchmarks. Our suite includes both fine-grained microworkloads from MIMDRAM to isolate specific operational characteristics, and large-scale applications from the PIMBench suite, such as the VGG network, to represent realistic end-to-end workloads. Our results quantitatively demonstrate that no single layout is universally superior; the optimal choice is strongly dependent on workload characteristics. BP excels on control-flow-intensive tasks with irregular memory access patterns, whereas BS shows substantial advantages in massively parallel, low-precision (e.g., INT4/INT8) computations common in AI. Based on this characterization, we distill a set of actionable design guidelines for architects. This work challenges the prevailing one-size-fits-all view on PIM data layouts and provides a principled foundation for designing next-generation, workload-aware, and potentially hybrid PIM systems.
