Towards Efficient LUT-based PIM: A Scalable and Low-Power Approach for Modern Workloads
Bahareh Khabbazan, Marc Riera, Antonio González
TL;DR
This work addresses the data-movement energy bottleneck in memory-centric workloads by introducing Lama, a lightweight LUT-based Processing-in-Memory (PuM) mechanism that enables independent column access within DRAM Mats, exploiting mat-level parallelism and the open-page policy to reduce energy-intensive ACT commands. Lama supports up to 8-bit operand precision with minimal area overhead and delivers substantial throughput and energy improvements over prior PuM approaches and CPUs for bulk 8-bit operations. To extend these gains to ML workloads, the authors present LamaAccel, an HBM-based accelerator that uses exponential quantization (DNA-TEQ) to transform dot-products into addition/counting tasks, enabling efficient accumulation within memory for attention-based models. Evaluations across bulk multiplications and large language models demonstrate strong performance and energy advantages over TPU, GPU, and prior PuM baselines, with scalable architecture through pseudo-channels and minimal DRAM timing changes. The combined Lama/LamaAccel framework offers a practical path to high-throughput, energy-efficient in-memory acceleration for modern workloads, including LLM inference.
Abstract
Data movement in memory-intensive workloads, such as deep learning, incurs energy costs that are over three orders of magnitude higher than the cost of computation. Since these workloads involve frequent data transfers between memory and processing units, addressing data movement overheads is crucial for improving performance. Processing-using-memory (PuM) offers an effective solution by enabling in-memory computation, thereby minimizing data transfers. In this paper we propose Lama, a LUT-based PuM architecture designed to efficiently execute SIMD operations by supporting independent column accesses within each mat of a DRAM subarray. Lama exploits DRAM's mat-level parallelism and open-page policy to significantly reduce the number of energy-intensive memory activation (ACT) commands, which are the primary source of overhead in most PuM architectures. Unlike prior PuM solutions, Lama supports up to 8-bit operand precision without decomposing computations, while incurring only a 2.47% area overhead. Our evaluation shows Lama achieves an average performance improvement of 8.5x over state-of-the-art PuM architectures and a 3.8x improvement over CPU, along with energy efficiency gains of 6.9x/8x, respectively, for bulk 8-bit multiplication. We also introduce LamaAccel, an HBM-based PuM accelerator that utilizes Lama to accelerate the inference of attention-based models. LamaAccel employs exponential quantization to optimize product/accumulation in dot-product operations, transforming them into simpler tasks like addition and counting. LamaAccel delivers up to 9.3x/19.2x reduction in energy and 4.8x/9.8x speedup over TPU/GPU, along with up to 5.8x energy reduction and 2.1x speedup over a state-of-the-art PuM baseline.
