CHIME: Energy-Efficient STT-RAM-based Concurrent Hierarchical In-Memory Processing
Dhruv Gajaria, Tosiron Adegbija, Kevin Gomez
TL;DR
CHIME introduces a hierarchical, in-memory processing framework that distributes heterogeneous bit-line computing units across L1, L2, and main memory using STT-RAM to mitigate data movement bottlenecks. By pipelining operations and strategically grouping and mapping compute units to memory levels, CHIME achieves large latency and energy gains over CPU-based and prior in-memory approaches, with average improvements around 81x in latency and 20x in energy compared to CPU, and over 34–36x latency and 4–20x energy gains relative to state-of-the-art STT-CiM variants. The design leverages relaxed retention caches, enhanced bit-line compute units, and compiler-assisted scheduling to maximize concurrency while keeping overheads modest. Overall, CHIME demonstrates that hierarchical, domain-specific in-memory processing can substantially improve performance and energy efficiency for a wide range of workloads, offering a practical path toward scalable, data-centric computing.
Abstract
Processing-in-cache (PiC) and Processing-in-memory (PiM) architectures, especially those utilizing bit-line computing, offer promising solutions to mitigate data movement bottlenecks within the memory hierarchy. While previous studies have explored the integration of compute units within individual memory levels, the complexity and potential overheads associated with these designs have often limited their capabilities. This paper introduces a novel PiC/PiM architecture, Concurrent Hierarchical In-Memory Processing (CHIME), which strategically incorporates heterogeneous compute units across multiple levels of the memory hierarchy. This design targets the efficient execution of diverse, domain-specific workloads by placing computations closest to the data where it optimizes performance, energy consumption, data movement costs, and area. CHIME employs STT-RAM due to its various advantages in PiC/PiM computing, such as high density, low leakage, and better resiliency to data corruption from activating multiple word lines. We demonstrate that CHIME enhances concurrency and improves compute unit utilization at each level of the memory hierarchy. We present strategies for exploring the design space, grouping, and placing the compute units across the memory hierarchy. Experiments reveal that, compared to the state-of-the-art bit-line computing approaches, CHIME achieves significant speedup and energy savings of 57.95% and 78.23% for various domain-specific workloads, while reducing the overheads associated with single-level compute designs.
