Table of Contents
Fetching ...

STT-RAM-based Hierarchical In-Memory Computing

Dhruv Gajaria, Kevin Antony Gomez, Tosiron Adegbija

TL;DR

This work introduces Hierarchical In-Memory Computing (HiMC), combining relaxed-retention STT-RAM-based PiC at the cache level with non-volatile STT-RAM PiM in main memory to minimize data movement and energy. It shows that PiC can deliver substantial latency and energy benefits for CPU-dependent workloads, while PiM remains advantageous for CPU-independent workloads, and demonstrates that a heterogeneous two-retention cache design can further optimize overall performance. The authors develop an architectural framework including a retention-time monitor, operation chaining, and PiC/PiM management, and provide a Thorough evaluation across eight workloads using validated simulation tools, yielding up to multi-fold speedups and significant area reductions compared to SRAM. The study highlights open research challenges in bit-line computing, compiler/hardware co-design, and data-flow-like architectures, setting path for scalable, energy-efficient in-memory computing in resource-constrained systems.

Abstract

In-memory computing promises to overcome the von Neumann bottleneck in computer systems by performing computations directly within the memory. Previous research has suggested using Spin-Transfer Torque RAM (STT-RAM) for in-memory computing due to its non-volatility, low leakage power, high density, endurance, and commercial viability. This paper explores hierarchical in-memory computing, where different levels of the memory hierarchy are augmented with processing elements to optimize workload execution. The paper investigates processing in memory (PiM) using non-volatile STT-RAM and processing in cache (PiC) using volatile STT-RAM with relaxed retention, which helps mitigate STT-RAM's write latency and energy overheads. We analyze tradeoffs and overheads associated with data movement for PiC versus write overheads for PiM using STT-RAMs for various workloads. We examine workload characteristics, such as computational intensity and CPU-dependent workloads with limited instruction-level parallelism, and their impact on PiC/PiM tradeoffs. Using these workloads, we evaluate computing in STT-RAM versus SRAM at different cache hierarchy levels and explore the potential of heterogeneous STT-RAM cache architectures with various retention times for PiC and CPU-based computing. Our experiments reveal significant advantages of STT-RAM-based PiC over PiM for specific workloads. Finally, we describe open research problems in hierarchical in-memory computing architectures to further enhance this paradigm.

STT-RAM-based Hierarchical In-Memory Computing

TL;DR

This work introduces Hierarchical In-Memory Computing (HiMC), combining relaxed-retention STT-RAM-based PiC at the cache level with non-volatile STT-RAM PiM in main memory to minimize data movement and energy. It shows that PiC can deliver substantial latency and energy benefits for CPU-dependent workloads, while PiM remains advantageous for CPU-independent workloads, and demonstrates that a heterogeneous two-retention cache design can further optimize overall performance. The authors develop an architectural framework including a retention-time monitor, operation chaining, and PiC/PiM management, and provide a Thorough evaluation across eight workloads using validated simulation tools, yielding up to multi-fold speedups and significant area reductions compared to SRAM. The study highlights open research challenges in bit-line computing, compiler/hardware co-design, and data-flow-like architectures, setting path for scalable, energy-efficient in-memory computing in resource-constrained systems.

Abstract

In-memory computing promises to overcome the von Neumann bottleneck in computer systems by performing computations directly within the memory. Previous research has suggested using Spin-Transfer Torque RAM (STT-RAM) for in-memory computing due to its non-volatility, low leakage power, high density, endurance, and commercial viability. This paper explores hierarchical in-memory computing, where different levels of the memory hierarchy are augmented with processing elements to optimize workload execution. The paper investigates processing in memory (PiM) using non-volatile STT-RAM and processing in cache (PiC) using volatile STT-RAM with relaxed retention, which helps mitigate STT-RAM's write latency and energy overheads. We analyze tradeoffs and overheads associated with data movement for PiC versus write overheads for PiM using STT-RAMs for various workloads. We examine workload characteristics, such as computational intensity and CPU-dependent workloads with limited instruction-level parallelism, and their impact on PiC/PiM tradeoffs. Using these workloads, we evaluate computing in STT-RAM versus SRAM at different cache hierarchy levels and explore the potential of heterogeneous STT-RAM cache architectures with various retention times for PiC and CPU-based computing. Our experiments reveal significant advantages of STT-RAM-based PiC over PiM for specific workloads. Finally, we describe open research problems in hierarchical in-memory computing architectures to further enhance this paradigm.
Paper Structure (27 sections, 2 equations, 12 figures, 2 tables, 1 algorithm)

This paper contains 27 sections, 2 equations, 12 figures, 2 tables, 1 algorithm.

Figures (12)

  • Figure 1: STT-RAM cell structure. The high resistance state is anti-parallel, while the low resistance state is parallel.
  • Figure 2: The system model featuring processing in cache (PiC) implemented in the L1 and L2 caches and processing in memory (PiM) implemented in the main memory.
  • Figure 3: The sensing architecture used in our work (similar to prior work jain2017computing), which works for both relaxed retention and non-volatile STT-RAM computing. (a) shows the sensed current for multiple word-lines and the reference signal position; (b) and (c) shows the logical compute circuits.
  • Figure 4: (a) shows the high-level structure of a cache block (b) illustrates the cache block monitor counter implemented using a finite state machine; (c) shows the subarray of a cache with computational logic block after the sense amplifier; and (d) shows the computational logic.
  • Figure 5: The probability distribution function of the sensing current for PiC STT-RAM under 5% process variation for 10,000 samples.
  • ...and 7 more figures