Table of Contents
Fetching ...

pLUTo: Enabling Massively Parallel Computation in DRAM via Lookup Tables

João Dinis Ferreira, Gabriel Falcao, Juan Gómez-Luna, Mohammed Alser, Lois Orosa, Mohammad Sadrosadati, Jeremie S. Kim, Geraldo F. Oliveira, Taha Shahroodi, Anant Nori, Onur Mutlu

TL;DR

pLUTo introduces a general-purpose LUT-based computation mechanism inside DRAM to overcome the narrow operation set of prior Processing-using-Memory (PuM) approaches. By implementing a bulk LUT-query primitive within DRAM subarrays and offering three architectural variants (BSA, GSA, GMC), it achieves high throughput and substantial energy savings across diverse workloads, including arithmetic, cryptography, image processing, and neural networks. The paper details a full system stack (ISA, library, compiler, and controller) and provides thorough evaluations showing substantial improvements over CPU, GPU, FPGA, and previous PiM systems, with manageable DRAM area overhead (10.2–23.1%). Acknowledging integration challenges, it also discusses LUT data loading, scalability via subarray-level parallelism, and limitations, while demonstrating a compelling case for combining PiM substrates in real systems. Overall, pLUTo broadens the practical applicability of PuM by enabling efficient in-DRAM execution of complex operations through LUT-based computing.

Abstract

Data movement between the main memory and the processor is a key contributor to execution time and energy consumption in memory-intensive applications. This data movement bottleneck can be alleviated using Processing-in-Memory (PiM). One category of PiM is Processing-using-Memory (PuM), in which computation takes place inside the memory array by exploiting intrinsic analog properties of the memory device. PuM yields high performance and energy efficiency, but existing PuM techniques support a limited range of operations. As a result, current PuM architectures cannot efficiently perform some complex operations (e.g., multiplication, division, exponentiation) without large increases in chip area and design complexity. To overcome these limitations of existing PuM architectures, we introduce pLUTo (processing-using-memory with lookup table (LUT) operations), a DRAM-based PuM architecture that leverages the high storage density of DRAM to enable the massively parallel storing and querying of lookup tables (LUTs). The key idea of pLUTo is to replace complex operations with low-cost, bulk memory reads (i.e., LUT queries) instead of relying on complex extra logic. We evaluate pLUTo across 11 real-world workloads that showcase the limitations of prior PuM approaches and show that our solution outperforms optimized CPU and GPU baselines by an average of 713$\times$ and 1.2$\times$, respectively, while simultaneously reducing energy consumption by an average of 1855$\times$ and 39.5$\times$. Across these workloads, pLUTo outperforms state-of-the-art PiM architectures by an average of 18.3$\times$. We also show that different versions of pLUTo provide different levels of flexibility and performance at different additional DRAM area overheads (between 10.2% and 23.1%). pLUTo's source code is openly and fully available at https://github.com/CMU-SAFARI/pLUTo.

pLUTo: Enabling Massively Parallel Computation in DRAM via Lookup Tables

TL;DR

pLUTo introduces a general-purpose LUT-based computation mechanism inside DRAM to overcome the narrow operation set of prior Processing-using-Memory (PuM) approaches. By implementing a bulk LUT-query primitive within DRAM subarrays and offering three architectural variants (BSA, GSA, GMC), it achieves high throughput and substantial energy savings across diverse workloads, including arithmetic, cryptography, image processing, and neural networks. The paper details a full system stack (ISA, library, compiler, and controller) and provides thorough evaluations showing substantial improvements over CPU, GPU, FPGA, and previous PiM systems, with manageable DRAM area overhead (10.2–23.1%). Acknowledging integration challenges, it also discusses LUT data loading, scalability via subarray-level parallelism, and limitations, while demonstrating a compelling case for combining PiM substrates in real systems. Overall, pLUTo broadens the practical applicability of PuM by enabling efficient in-DRAM execution of complex operations through LUT-based computing.

Abstract

Data movement between the main memory and the processor is a key contributor to execution time and energy consumption in memory-intensive applications. This data movement bottleneck can be alleviated using Processing-in-Memory (PiM). One category of PiM is Processing-using-Memory (PuM), in which computation takes place inside the memory array by exploiting intrinsic analog properties of the memory device. PuM yields high performance and energy efficiency, but existing PuM techniques support a limited range of operations. As a result, current PuM architectures cannot efficiently perform some complex operations (e.g., multiplication, division, exponentiation) without large increases in chip area and design complexity. To overcome these limitations of existing PuM architectures, we introduce pLUTo (processing-using-memory with lookup table (LUT) operations), a DRAM-based PuM architecture that leverages the high storage density of DRAM to enable the massively parallel storing and querying of lookup tables (LUTs). The key idea of pLUTo is to replace complex operations with low-cost, bulk memory reads (i.e., LUT queries) instead of relying on complex extra logic. We evaluate pLUTo across 11 real-world workloads that showcase the limitations of prior PuM approaches and show that our solution outperforms optimized CPU and GPU baselines by an average of 713 and 1.2, respectively, while simultaneously reducing energy consumption by an average of 1855 and 39.5. Across these workloads, pLUTo outperforms state-of-the-art PiM architectures by an average of 18.3. We also show that different versions of pLUTo provide different levels of flexibility and performance at different additional DRAM area overheads (between 10.2% and 23.1%). pLUTo's source code is openly and fully available at https://github.com/CMU-SAFARI/pLUTo.

Paper Structure

This paper contains 64 sections, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Internal organization of a DRAM module.
  • Figure 2: Main components of pLUTo.
  • Figure 3: A pLUTo LUT Query: (a) a LUT containing the first four prime numbers and an example user-specified LUT query, (b) setup of pLUTo's main components prior to the execution of the pLUTo LUT Query, and (c) steps of the pLUTo LUT Query. This pLUTo LUT Query returns into the destination row buffer (not depicted) the i-th prime number for each LUT index in the source row buffer.
  • Figure 4: The three pLUTo designs. m-c switch stands for matchline-controlled switch. Orange-dashed lines show how charge flows in case the matchline signal is asserted.
  • Figure 5: pLUTo's system integration stack. An example is shown for the C code displayed in . Subsequent steps are shown in top-down, left-to-right order: implementation using pLUTo's API Library, the transformation of the pLUTo API code performed by the pLUTo Compiler, data dependency graph analysis, the role of the pLUTo Controller and in-memory execution.
  • ...and 9 more figures