Table of Contents
Fetching ...

Memory-Centric Computing: Recent Advances in Processing-in-DRAM

Onur Mutlu, Ataberk Olgun, Geraldo F. Oliveira, Ismail Emir Yuksel

TL;DR

Memory-centric computing aims to reduce data movement by performing computation in or near memory. The paper surveys Processing-in-DRAM approaches, including near-DRAM compute and DRAM-based analog computation, and highlights key architectures like SIMDRAM, MIMDRAM, and Sectored DRAM. It demonstrates that real COTS DRAM chips can execute bulk-bitwise operations and even TRNG using multiple-row activations, and that fine-grained DRAM designs can significantly improve efficiency. The results suggest substantial performance and energy benefits and argue for a paradigm shift in how hardware and software are designed around memory as a computation substrate.

Abstract

Memory-centric computing aims to enable computation capability in and near all places where data is generated and stored. As such, it can greatly reduce the large negative performance and energy impact of data access and data movement, by 1) fundamentally avoiding data movement, 2) reducing data access latency & energy, and 3) exploiting large parallelism of memory arrays. Many recent studies show that memory-centric computing can largely improve system performance & energy efficiency. Major industrial vendors and startup companies have recently introduced memory chips with sophisticated computation capabilities. Going forward, both hardware and software stack should be revisited and designed carefully to take advantage of memory-centric computing. This work describes several major recent advances in memory-centric computing, specifically in Processing-in-DRAM, a paradigm where the operational characteristics of a DRAM chip are exploited and enhanced to perform computation on data stored in DRAM. Specifically, we describe 1) new techniques that slightly modify DRAM chips to enable both enhanced computation capability and easier programmability, 2) new experimental studies that demonstrate the functionally-complete bulk-bitwise computational capability of real commercial off-the-shelf DRAM chips, without any modifications to the DRAM chip or the interface, and 3) new DRAM designs that improve access granularity & efficiency, unleashing the true potential of Processing-in-DRAM.

Memory-Centric Computing: Recent Advances in Processing-in-DRAM

TL;DR

Memory-centric computing aims to reduce data movement by performing computation in or near memory. The paper surveys Processing-in-DRAM approaches, including near-DRAM compute and DRAM-based analog computation, and highlights key architectures like SIMDRAM, MIMDRAM, and Sectored DRAM. It demonstrates that real COTS DRAM chips can execute bulk-bitwise operations and even TRNG using multiple-row activations, and that fine-grained DRAM designs can significantly improve efficiency. The results suggest substantial performance and energy benefits and argue for a paradigm shift in how hardware and software are designed around memory as a computation substrate.

Abstract

Memory-centric computing aims to enable computation capability in and near all places where data is generated and stored. As such, it can greatly reduce the large negative performance and energy impact of data access and data movement, by 1) fundamentally avoiding data movement, 2) reducing data access latency & energy, and 3) exploiting large parallelism of memory arrays. Many recent studies show that memory-centric computing can largely improve system performance & energy efficiency. Major industrial vendors and startup companies have recently introduced memory chips with sophisticated computation capabilities. Going forward, both hardware and software stack should be revisited and designed carefully to take advantage of memory-centric computing. This work describes several major recent advances in memory-centric computing, specifically in Processing-in-DRAM, a paradigm where the operational characteristics of a DRAM chip are exploited and enhanced to perform computation on data stored in DRAM. Specifically, we describe 1) new techniques that slightly modify DRAM chips to enable both enhanced computation capability and easier programmability, 2) new experimental studies that demonstrate the functionally-complete bulk-bitwise computational capability of real commercial off-the-shelf DRAM chips, without any modifications to the DRAM chip or the interface, and 3) new DRAM designs that improve access granularity & efficiency, unleashing the true potential of Processing-in-DRAM.
Paper Structure (7 sections, 14 figures)

This paper contains 7 sections, 14 figures.

Figures (14)

  • Figure 1: An example of performing the MAJority-of-three operation (i.e., MAJ3(A,B,C)) (a) and the NOT operation (i.e., dst=NOT(src)) in Ambit seshadri2017ambit. In (a), we focus on DRAM cell and sense amplifier operations (). Initially, cells A, B, C, and bitline have voltage levels of GND, VDD, VDD, and VDD/2, respectively (). We first perform a triple-row activation (TRA) to simultaneously activate cells A, B, and C (). When the wordlines of all three cells are raised simultaneously, charge sharing results in a positive deviation on the bitline because at least two of the cells are charged (). Therefore, after sense amplification, the sense amplifier drives the bitline to VDD, which then fully charges all three cells (). The final state of the bitline is, thus, the MAJority function of the charged state of the three cells A, B, and C. If one of the cells (say C) is set to GND (VDD), the final state would be the AND (OR) of the other two (A and B). To simplify the explanation, we assume no process variation and noise, but the Ambit paper and later works hajinazarsimdramseshadri2017ambit evaluate these effects. Ambit-NOT (b) introduces the dual-contact cell (DCC), which is a DRAM cell with two transistors. In a DCC, one transistor connects the cell capacitor to the bitline, i.e., data wordline, and the other transistor connects the cell capacitor to the bitline-bar, i.e., negation wordline (). Initially, src and dst cells each have a voltage level of VDD, and bitline and bitline-bar are precharged to VDD/2. To perform the NOT operation, we first activate the src cell (). The activation drives the bitline to the value corresponding to the src, VDD in this case, and the bitline-bar to the negated value, i.e., GND (). Second, Ambit activates the negation wordline. Doing so enables the transistor that connects the DCC to the bitline-bar. This results in the bitline-bar sharing its charge with the dst cell (). Since the bitline-bar is already at a stable voltage level of GND, it overwrites the value in the DCC capacitor with GND, thereby copying the negated value of the src cell into the dst cell (). The original Ambit paper (see Section 5 in seshadri2017ambit) proposes a DRAM subarray design that makes the implementation of triple-row activation low overhead by restricting TRA to an isolated set of DRAM rows that can be used for computation. It also describes circuit-level issues & system and programming support needed for Ambit and evaluates the hardware cost of modifications made to the DRAM chip and the memory controller. A later work seshadri2019dram describes some outstanding issues in Ambit-like bulk-bitwise Processing-using-DRAM substrates.
  • Figure 2: Overview of the SIMDRAM framework hajinazarsimdram. SIMDRAM consists of three key steps to enable a user-specified desired operation in DRAM: 1) building an efficient MAJ/NOT-based representation of the desired operation, 2) mapping the operation input and output operands to DRAM rows and to the required DRAM commands that produce the desired operation, and 3) executing the operation. The first two steps give users the flexibility to implement and compute any desired operation in DRAM efficiently. The goal of the first step is to use logic optimization to minimize the number of DRAM row activations and, thus, the computation latency required to perform a specific operation. Accordingly, the first step () takes as input the AND/OR/NOT-based implementation of the designed operation (labeled in the figure) and derives the operation's optimized MAJ/NOT-based implementation (i.e., the optimized majority inverter graph). The second step () translates the optimized MAJ/NOT-based implementation into DRAM row activations, i.e., Ambit TRA seshadri2017ambit and RowClone seshadri2013rowclone operations. This step includes 1) mapping the operands to the designated rows in DRAM and 2) defining the sequence of DRAM row activations required to perform the computation associated with the optimized MAJ/NOT implementation. SIMDRAM chooses the operand-to-row mapping and the sequence of DRAM row activations to minimize the number of DRAM row activations required for a specific operation. The output of the second step is stored as a microprogram ($\mu$Program) in the memory controller, associated with the desired operation, bbop$\mathunderscore$new. The third step () is to program the memory controller to issue the sequence of DRAM row activations to the appropriate rows in DRAM to perform the computation of the operation from start to end. When the user program () encounters a SIMDRAM instruction (called bbop$\mathunderscore$new), the instruction is shipped to the memory controller, which invokes the associated $\mu$Program and executes the operation as specified by the $\mu$Program. To this end, SIMDRAM uses a control unit in the memory controller that transparently executes the sequence of DRAM row activations for each specific PuD operation executed by a user program. Once the $\mu$Program is complete, the result of the operation is held in DRAM. Figure adapted from hajinazarsimdram.
  • Figure 3: Overview of the DRAM subarray and bank organization of MIMDRAM oliveira2024mimdram. Green-colored boxes represent newly added or modified hardware components. To enable fine-grained PuD execution, MIMDRAM modifies Ambit's subarray and the DRAM bank with three new hardware structures: the mat isolation transistor (), the row decoder latch (), and the mat selector (). At a high level, the mat isolation transistor allows for the independent access and operation of each DRAM mat within a subarray while the row decoder latch enables the execution of a PuD operation in a range of DRAM mats that the mat selector defines. MIMDRAM implements an inter-mat interconnect () to enable data movement across different mats by slightly modifying the connection between the I/O bus and the global sense amplifier. MIMDRAM adds a 2:1 multiplexer to each set of four 1-bit sense amplifiers in the global sense amplifier, selecting whether the data written to the sense amplifier set ($SA_{i}$) comes from the I/O bus or the neighboring sense amplifier set ($SA_{i-1}$). MIMDRAM enables data movement across columns within a DRAM mat through an intra-mat interconnect (), which works by modifying the sequence of steps in the column access operation (hence without any hardware modification to the DRAM subarray structure). The intra-mat interconnect leverages the fact that 1) local bitlines in a mat already share an interconnection link via the helper flip-flops (HFFs) and 2) these HFFs can latch and amplify the local row buffer's data. Figure adapted from oliveira2024mimdram.
  • Figure 4: Intra-mat data movement in MIMDRAM oliveira2024mimdram. To enable data movement across columns within a DRAM mat, MIMDRAM implements an intra-mat interconnect (in Fig.\ref{['fig:mimdram_overview']}), which does not require any hardware modification. Instead, MIMDRAM modifies the sequence of steps DRAM executes during a column access to realize an intra-mat data movement operation. There are two key observations that enable the intra-mat interconnect. First, we observe that the local bitlines of a DRAM mat already share an interconnection path via the HFFs and column select logic (as this figure illustrates). Second, the HFFs in a DRAM mat can latch and amplify the local row buffer’s data. To manage intra-mat data movement, MIMDRAM exposes a new DRAM command to the memory controller called LC-MOV (local I/O move). The LC-MOV command takes as input: (i) the logical mat range [mat begin,mat end] of the target row, (ii) the row and column addresses of the source DRAM row and column; and (iii) the row and column addresses of the destination DRAM row and column. With the intra-mat interconnect and new DRAM command, MIMDRAM can move four bits of data from a source row and column (row$_{src}$, column$_{src}$) to a destination row and column (row$_{dst}$, column$_{dst}$) in the same mat ($mat_{M}$). An LC-MOV command is transparently generated by MIMDRAM control unit based on the source and destination mat addresses in a bbop_mov instruction (which MIMDRAM compiler generates; see oliveira2024mimdram): if the source and destination mats' addresses are the same, MIMDRAM control unit translates the data movement instruction into an LC-MOV command; otherwise, into a GB-MOV command (global I/O move; see oliveira2024mimdram and Fig.\ref{['fig:vector_reduction']}). Once the memory controller receives an LC-MOV command, it performs two steps. In the first step, the memory controller performs an ACT--RD--PRE targeting row$_{src}$, column$_{src}$ in $mat_{M}$. The ACT loads row$_{src}$ to $mat_{M}$'s local sense amplifier (). The RD moves four bits from row$_{src}$, as indexed by column$_{src}$, into the mat's helper flip-flops (HFFs) by enabling the appropriate transistors in the column select logic (). The HFFs are then enabled by transitioning the HFF enable signal from low to high. This allows the HFF to latch and amplify the selected four-bit data column from the local sense amplifier (). The PRE closes row$_{src}$. Until here, the LC-MOV command operates exactly as a regular ACT-- RD--PRE command sequence. However, differently from a regular ACT--RD-- PRE, the LC-MOV command does not lower the HFF enable signal when the RD finishes. This allows the four-bit data from column$_{src}$ to reside in the mat's HFF. In the second step, the memory controller performs an ACT--WR--PRE targeting row$_{dst}$, column$_{dst}$ in $mat_{M}$. The ACT loads row$_{dst}$ into the mat's local row buffer (), and the WR asserts the column select logic to column$_{dst}$, creating a path between the HFF and the local row buffer (). Since the HFF enable signal is kept high, the HFFs do not sense and latch the data from column$_{dst}$. Instead, the HFFs overwrite the data stored in the local sense amplifier with the previously four-bit data latched from column$_{src}$. The new data stored in the mat's local sense amplifier propagates through the local bitlines and is written to the destination DRAM cells (). Figure adapted from oliveira2024mimdram.
  • Figure 5: An example of a PuD vector reduction, i.e., out+=(A[i]+B[i]), in MIMDRAM oliveira2024mimdram. For illustration purposes, we assume that DRAM has only two mats, and the 10 1-bit data elements of the input arrays A and B are evenly distributed across the two DRAM mats. MIMDRAM executes a vector reduction in three steps. In the first step, MIMDRAM executes a PuD addition operation over the data in the two DRAM mats (), storing the temporary output data C into the same mats where the computation takes place (i.e., C = {C[9:5]$_{mat1}$, C[4:0]$_{mat0}$}). In the second step, MIMDRAM issues a GB-MOV (global I/O move; a new DRAM command to perform inter-mat data movement) to move part of the temporary output C[4:0] stored in $mat_{0}$ to a temporary row tmp in $mat_{1}$ (tmp[9:5]$_{mat1}$$\leftarrow$C[4:0]$_{mat0}$) via the inter-mat interconnect (--), four bits, i.e., four data elements, at a time, which corresponds to the size of the helper flip-flops (not shown in the figure). MIMDRAM iteratively executes step 2 until all data elements of C[4:0] are copied to $mat_{1}$. In the third step, once the GB-MOV finishes, MIMDRAM executes the final addition operation, i.e., tmp[9:5] + C[9:5], in $mat_{1}$. The final output of the vector reduction operation is stored in the destination row out in $mat_{1}$ (). Figure adapted from oliveira2024mimdram.
  • ...and 9 more figures