SAL-PIM: A Subarray-level Processing-in-Memory Architecture with LUT-based Linear Interpolation for Transformer-based Text Generation
Wontak Han, Hyunjun Cho, Donghyuk Kim, Joo-Young Kim
TL;DR
SAL-PIM introduces a subarray-level processing-in-memory architecture on HBM2 to accelerate end-to-end transformer-based text generation, addressing memory bandwidth limits in the generation stage and the need to compute nonlinear functions efficiently. It combines subarray-level ALUs (S-ALU), LUT-based linear interpolation via LUT-embedded subarrays, and channel-level accumulation (C-ALU) to enable full GPT decoder operations with reduced data movement. The authors validate the design with a cycle-accurate simulator and 28-nm CMOS implementations, achieving up to 4.72x speedup and an average of 1.83x over a server GPU for GPT-2 medium on realistic input/output sizes, while maintaining modest area overhead (~4.8%) and manageable power (~9% of the HBM2 budget). These results demonstrate the viability of end-to-end PIM for large-scale text generation and highlight directions for scaling to larger models and heterogeneous execution to further boost performance and energy efficiency.
Abstract
Text generation is a compelling sub-field of natural language processing, aiming to generate human-readable text from input words. In particular, the decoder-only generative models, such as generative pre-trained transformer (GPT), are widely used for text generation, with two major computational stages: summarization and generation. Unlike the summarization stage, which can process the input tokens in parallel, the generation stage is difficult to accelerate due to its sequential generation of output tokens through iteration. Moreover, each iteration requires reading a whole model with little data reuse opportunity. Therefore, the workload of transformer-based text generation is severely memory-bound, making the external memory bandwidth system bottleneck. In this paper, we proposed a subarray-level processing-in-memory architecture named SAL-PIM, HBM-based PIM architecture for the end-to-end acceleration of transformer-based text generation. The SAL-PIM architecture includes three architectural features. First, the SAL-PIM architecture utilizes higher internal bandwidth by integrating multiple subarray-level arithmetic logic units with optimized data mapping schemes. Second, the SAL-PIM architecture adopts LUT-based linear interpolation to perform complex non-linear functions in PIM. Third, the SAL-PIM architecture accelerates end-to-end inference on PIM in text generation. Furthermore, to validate the SAL-PIM architecture, we built cycle-accurate simulator and implemented the SAL-PIM's logic units in 28-nm CMOS technology. As a result, when the input size is from 32 to 128 and the output size is from 1 to 256, SAL-PIM achieves a maximum of 4.72 times speedup and an average of 1.83 times speedup for the text generation based on the GPT-2 medium model compared to the server-level GPU.
