Table of Contents
Fetching ...

CHIME: Energy-Efficient STT-RAM-based Concurrent Hierarchical In-Memory Processing

Dhruv Gajaria, Tosiron Adegbija, Kevin Gomez

TL;DR

CHIME introduces a hierarchical, in-memory processing framework that distributes heterogeneous bit-line computing units across L1, L2, and main memory using STT-RAM to mitigate data movement bottlenecks. By pipelining operations and strategically grouping and mapping compute units to memory levels, CHIME achieves large latency and energy gains over CPU-based and prior in-memory approaches, with average improvements around 81x in latency and 20x in energy compared to CPU, and over 34–36x latency and 4–20x energy gains relative to state-of-the-art STT-CiM variants. The design leverages relaxed retention caches, enhanced bit-line compute units, and compiler-assisted scheduling to maximize concurrency while keeping overheads modest. Overall, CHIME demonstrates that hierarchical, domain-specific in-memory processing can substantially improve performance and energy efficiency for a wide range of workloads, offering a practical path toward scalable, data-centric computing.

Abstract

Processing-in-cache (PiC) and Processing-in-memory (PiM) architectures, especially those utilizing bit-line computing, offer promising solutions to mitigate data movement bottlenecks within the memory hierarchy. While previous studies have explored the integration of compute units within individual memory levels, the complexity and potential overheads associated with these designs have often limited their capabilities. This paper introduces a novel PiC/PiM architecture, Concurrent Hierarchical In-Memory Processing (CHIME), which strategically incorporates heterogeneous compute units across multiple levels of the memory hierarchy. This design targets the efficient execution of diverse, domain-specific workloads by placing computations closest to the data where it optimizes performance, energy consumption, data movement costs, and area. CHIME employs STT-RAM due to its various advantages in PiC/PiM computing, such as high density, low leakage, and better resiliency to data corruption from activating multiple word lines. We demonstrate that CHIME enhances concurrency and improves compute unit utilization at each level of the memory hierarchy. We present strategies for exploring the design space, grouping, and placing the compute units across the memory hierarchy. Experiments reveal that, compared to the state-of-the-art bit-line computing approaches, CHIME achieves significant speedup and energy savings of 57.95% and 78.23% for various domain-specific workloads, while reducing the overheads associated with single-level compute designs.

CHIME: Energy-Efficient STT-RAM-based Concurrent Hierarchical In-Memory Processing

TL;DR

CHIME introduces a hierarchical, in-memory processing framework that distributes heterogeneous bit-line computing units across L1, L2, and main memory using STT-RAM to mitigate data movement bottlenecks. By pipelining operations and strategically grouping and mapping compute units to memory levels, CHIME achieves large latency and energy gains over CPU-based and prior in-memory approaches, with average improvements around 81x in latency and 20x in energy compared to CPU, and over 34–36x latency and 4–20x energy gains relative to state-of-the-art STT-CiM variants. The design leverages relaxed retention caches, enhanced bit-line compute units, and compiler-assisted scheduling to maximize concurrency while keeping overheads modest. Overall, CHIME demonstrates that hierarchical, domain-specific in-memory processing can substantially improve performance and energy efficiency for a wide range of workloads, offering a practical path toward scalable, data-centric computing.

Abstract

Processing-in-cache (PiC) and Processing-in-memory (PiM) architectures, especially those utilizing bit-line computing, offer promising solutions to mitigate data movement bottlenecks within the memory hierarchy. While previous studies have explored the integration of compute units within individual memory levels, the complexity and potential overheads associated with these designs have often limited their capabilities. This paper introduces a novel PiC/PiM architecture, Concurrent Hierarchical In-Memory Processing (CHIME), which strategically incorporates heterogeneous compute units across multiple levels of the memory hierarchy. This design targets the efficient execution of diverse, domain-specific workloads by placing computations closest to the data where it optimizes performance, energy consumption, data movement costs, and area. CHIME employs STT-RAM due to its various advantages in PiC/PiM computing, such as high density, low leakage, and better resiliency to data corruption from activating multiple word lines. We demonstrate that CHIME enhances concurrency and improves compute unit utilization at each level of the memory hierarchy. We present strategies for exploring the design space, grouping, and placing the compute units across the memory hierarchy. Experiments reveal that, compared to the state-of-the-art bit-line computing approaches, CHIME achieves significant speedup and energy savings of 57.95% and 78.23% for various domain-specific workloads, while reducing the overheads associated with single-level compute designs.
Paper Structure (22 sections, 5 equations, 7 figures, 3 tables, 1 algorithm)

This paper contains 22 sections, 5 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: A compute unit with logical, add, subtract, shift, compare, and multiply operations adds significant overhead to a traditional cache/memory.
  • Figure 2: High-level overview of the proposed hierarchical in-memory processing approach, illustrating the flow of data and computation between low latency reduced retention STT-RAM caches (PiC) and non-volatile STT-RAM main memory (PiM).
  • Figure 3: Illustrations of (a) modified sense amplifiers for bit-line computing, (b) STT-RAM cache subarray of MTJ cells with compute units after sense amplifiers, (c) components of a cache block with (d) a 2-bit cache monitor counter.
  • Figure 4: Program execution for hierarchical computing without and with pipelining.
  • Figure 5: Frequency of compute groups for each workload.
  • ...and 2 more figures