Table of Contents
Fetching ...

No One-Size-Fits-All: A Workload-Driven Characterization of Bit-Parallel vs. Bit-Serial Data Layouts for Processing-using-Memory

Jingyao Zhang, Elaheh Sadredini

TL;DR

The paper tackles the memory‑wall bottleneck by showing that Bit‑Parallel and Bit‑Serial data layouts in Processing‑in‑Memory are not universally interchangeable. It builds iso‑area, cycle‑accurate models for BP and BS, and evaluates them across a two‑tier benchmark suite to reveal workload‑dependent performance, with up to a 14× variation and a hybrid path delivering up to a 2.66× speedup over the best static layout. A workload taxonomy is introduced to guide architects in selecting BP, BS, or hybrid schemes, and the work advocates low‑cost, fast transpose hardware combined with compilers that partition code by layout phase. Collectively, the results argue for workload‑aware, adaptable PIM systems rather than fixed data‑layout choices, enabling more efficient, next‑generation memory‑centric accelerators.

Abstract

Processing-in-Memory (PIM) is a promising approach to overcoming the memory-wall bottleneck. However, the PIM community has largely treated its two fundamental data layouts, Bit-Parallel (BP) and Bit-Serial (BS), as if they were interchangeable. This implicit "one-layout-fits-all" assumption, often hard-coded into existing evaluation frameworks, creates a critical gap: architects lack systematic, workload-driven guidelines for choosing the optimal data layout for their target applications. To address this gap, this paper presents the first systematic, workload-driven characterization of BP and BS PIM architectures. We develop iso-area, cycle-accurate BP and BS PIM architectural models and conduct a comprehensive evaluation using a diverse set of benchmarks. Our suite includes both fine-grained microworkloads from MIMDRAM to isolate specific operational characteristics, and large-scale applications from the PIMBench suite, such as the VGG network, to represent realistic end-to-end workloads. Our results quantitatively demonstrate that no single layout is universally superior; the optimal choice is strongly dependent on workload characteristics. BP excels on control-flow-intensive tasks with irregular memory access patterns, whereas BS shows substantial advantages in massively parallel, low-precision (e.g., INT4/INT8) computations common in AI. Based on this characterization, we distill a set of actionable design guidelines for architects. This work challenges the prevailing one-size-fits-all view on PIM data layouts and provides a principled foundation for designing next-generation, workload-aware, and potentially hybrid PIM systems.

No One-Size-Fits-All: A Workload-Driven Characterization of Bit-Parallel vs. Bit-Serial Data Layouts for Processing-using-Memory

TL;DR

The paper tackles the memory‑wall bottleneck by showing that Bit‑Parallel and Bit‑Serial data layouts in Processing‑in‑Memory are not universally interchangeable. It builds iso‑area, cycle‑accurate models for BP and BS, and evaluates them across a two‑tier benchmark suite to reveal workload‑dependent performance, with up to a 14× variation and a hybrid path delivering up to a 2.66× speedup over the best static layout. A workload taxonomy is introduced to guide architects in selecting BP, BS, or hybrid schemes, and the work advocates low‑cost, fast transpose hardware combined with compilers that partition code by layout phase. Collectively, the results argue for workload‑aware, adaptable PIM systems rather than fixed data‑layout choices, enabling more efficient, next‑generation memory‑centric accelerators.

Abstract

Processing-in-Memory (PIM) is a promising approach to overcoming the memory-wall bottleneck. However, the PIM community has largely treated its two fundamental data layouts, Bit-Parallel (BP) and Bit-Serial (BS), as if they were interchangeable. This implicit "one-layout-fits-all" assumption, often hard-coded into existing evaluation frameworks, creates a critical gap: architects lack systematic, workload-driven guidelines for choosing the optimal data layout for their target applications. To address this gap, this paper presents the first systematic, workload-driven characterization of BP and BS PIM architectures. We develop iso-area, cycle-accurate BP and BS PIM architectural models and conduct a comprehensive evaluation using a diverse set of benchmarks. Our suite includes both fine-grained microworkloads from MIMDRAM to isolate specific operational characteristics, and large-scale applications from the PIMBench suite, such as the VGG network, to represent realistic end-to-end workloads. Our results quantitatively demonstrate that no single layout is universally superior; the optimal choice is strongly dependent on workload characteristics. BP excels on control-flow-intensive tasks with irregular memory access patterns, whereas BS shows substantial advantages in massively parallel, low-precision (e.g., INT4/INT8) computations common in AI. Based on this characterization, we distill a set of actionable design guidelines for architects. This work challenges the prevailing one-size-fits-all view on PIM data layouts and provides a principled foundation for designing next-generation, workload-aware, and potentially hybrid PIM systems.

Paper Structure

This paper contains 34 sections, 1 equation, 8 figures, 8 tables.

Figures (8)

  • Figure 1: In-SRAM bitline operations. Simultaneous activation of two wordlines accomplishes AND/NOR operations, while an extra NOR gate facilitates XOR operation.
  • Figure 2: The four hierarchical data layout schemes resulting from combining bit-level (BP, BS) and vector-level (EP, ES) organization. (a) EP-BP: Ideal for inter-vector operations on wide data. (b) EP-BS: Maximizes parallelism for inter-vector operations. (c) ES-BP: Efficiently buffers a single vector for intra-vector operations. (d) ES-BS: Prone to row overflow for all but the simplest intra-vector tasks.
  • Figure 3: Contrasting a 4-element vector addition in (a) Bit-Parallel (BP) and (b) Bit-Serial (BS) layouts. BP configures the array into four wide PEs, utilizing the full array width. BS uses only four 1-bit columns, leaving most of the hardware idle.
  • Figure 4: Physical data layout for a 4-tap FIR filter. (a) In the BP layout, all required variables—coefficients, state, intermediate products, and the final output—are stored in separate rows, comfortably fitting within the array. (b) The BS layout attempts to store all these variables vertically, causing a massive row overflow.
  • Figure 5: Contrasting two permutation mechanisms. Left: a physical shuffle requires explicit data movement between memory locations, incurring multiple read-write cycles. Right: a logical shuffle achieves the same permutation through zero-cost address remapping, native to BP layouts with Element-Serial organization.
  • ...and 3 more figures