Table of Contents
Fetching ...

MEADOW: Memory-efficient Dataflow and Data Packing for Low Power Edge LLMs

Abhishek Moitra, Arkapravo Ghosh, Shrey Agarwal, Aporva Amarnath, Karthik Swaminathan, Priyadarshini Panda

TL;DR

MEADOW addresses the memory and bandwidth bottlenecks of edge-Large Language Model inference by introducing a token-parallel head-sequential (TPHS) dataflow for Q/QKT/SM/SMxV layers and a weight packing scheme that decomposes weight matrices into unique chunks with packet-specific precision and frequency-aware re-indexing. The architecture combines a GEMM path for KV/Proj/MLP with TPHS for the remaining layers, supported by a Weight Unpacking/Index Look-up unit and a NoC-enabled tiled PE array, achieving substantial reductions in off-chip data transfers. On a sub-10W Xilinx ZCU102, MEADOW delivers up to about 2.5x prefill and 1.5x decode latency improvements and over 40% end-to-end gains relative to prior LLM optimizations, while also showing ViT latency benefits. This approach enables practical edge deployment of LLMs and related vision transformers by significantly reducing memory traffic without sacrificing accuracy, broadening the feasible applications for low-power edge AI.

Abstract

The computational and memory challenges of large language models (LLMs) have sparked several optimization approaches towards their efficient implementation. While prior LLM-targeted quantization, and prior works on sparse acceleration have significantly mitigated the memory and computation bottleneck, they do so assuming high power platforms such as GPUs and server-class FPGAs with large off-chip memory bandwidths and employ a generalized matrix multiplication (GEMM) execution of all the layers in the decoder. In such a GEMM-based execution, data is fetched from an off-chip memory, computed and stored back. However, at reduced off-chip memory capacities, as is the case with low-power edge devices, this implementation strategy significantly increases the attention computation latency owing to the repeated storage and fetch of large intermediate tokens to and from the off-chip memory. Moreover, fetching the weight matrices from a bandwidth constrained memory further aggravates the memory bottleneck problem. To this end, we introduce MEADOW, a framework that significantly reduces the off-chip memory access for LLMs with a novel token-parallel head-sequential (TPHS) dataflow. Additionally, MEADOW applies weight packing that performs loss-less decomposition of large weight matrices to their unique elements thereby, reducing the enormous weight fetch latency. MEADOW demonstrates 1.5x and 2.5x lower decode and prefill latency, respectively, compared to a GEMM-based LLM implementation on the low power Xilinx ZCU102 FPGA platform that consumes less than 10W. Additionally, MEADOW achieves an end-to-end latency improvement of over 40%, compared to prior LLM optimization works.

MEADOW: Memory-efficient Dataflow and Data Packing for Low Power Edge LLMs

TL;DR

MEADOW addresses the memory and bandwidth bottlenecks of edge-Large Language Model inference by introducing a token-parallel head-sequential (TPHS) dataflow for Q/QKT/SM/SMxV layers and a weight packing scheme that decomposes weight matrices into unique chunks with packet-specific precision and frequency-aware re-indexing. The architecture combines a GEMM path for KV/Proj/MLP with TPHS for the remaining layers, supported by a Weight Unpacking/Index Look-up unit and a NoC-enabled tiled PE array, achieving substantial reductions in off-chip data transfers. On a sub-10W Xilinx ZCU102, MEADOW delivers up to about 2.5x prefill and 1.5x decode latency improvements and over 40% end-to-end gains relative to prior LLM optimizations, while also showing ViT latency benefits. This approach enables practical edge deployment of LLMs and related vision transformers by significantly reducing memory traffic without sacrificing accuracy, broadening the feasible applications for low-power edge AI.

Abstract

The computational and memory challenges of large language models (LLMs) have sparked several optimization approaches towards their efficient implementation. While prior LLM-targeted quantization, and prior works on sparse acceleration have significantly mitigated the memory and computation bottleneck, they do so assuming high power platforms such as GPUs and server-class FPGAs with large off-chip memory bandwidths and employ a generalized matrix multiplication (GEMM) execution of all the layers in the decoder. In such a GEMM-based execution, data is fetched from an off-chip memory, computed and stored back. However, at reduced off-chip memory capacities, as is the case with low-power edge devices, this implementation strategy significantly increases the attention computation latency owing to the repeated storage and fetch of large intermediate tokens to and from the off-chip memory. Moreover, fetching the weight matrices from a bandwidth constrained memory further aggravates the memory bottleneck problem. To this end, we introduce MEADOW, a framework that significantly reduces the off-chip memory access for LLMs with a novel token-parallel head-sequential (TPHS) dataflow. Additionally, MEADOW applies weight packing that performs loss-less decomposition of large weight matrices to their unique elements thereby, reducing the enormous weight fetch latency. MEADOW demonstrates 1.5x and 2.5x lower decode and prefill latency, respectively, compared to a GEMM-based LLM implementation on the low power Xilinx ZCU102 FPGA platform that consumes less than 10W. Additionally, MEADOW achieves an end-to-end latency improvement of over 40%, compared to prior LLM optimization works.

Paper Structure

This paper contains 18 sections, 1 equation, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Figure showing the (a) Decoder architecture used in LLMs (b) the prefill latency distribution across data fetch, store and computation across different layers in the decoder (c) the decode latency distributions. During decode, compute and storage latency is negligible compared to the weight and input fetch latency. All latency results are based on OPT-125M LLM implementation on the Xilinx ZCU102 FPGA with off-chip DRAM bandwidth = 12Gbps.
  • Figure 2: (a) Tiled architecture of MEADOW containing parallel and broadcasting processing elements (PEs), pipelined softmax (SM) module, modules for layer normalization (LN) and non-linear activation functions like ReLU/GeLU (NL). (b) The hybrid PE architecture capable of operating in GEMM and pipelined modes. (c) Architecture and execution flow of a parallel and broadcasting MAC PE. (d) The pipelined softmax (SM) module.
  • Figure 3: Figure showing an example of (a) token parallel head sequential (TPHS) dataflow with two input tokens being processed parallely (b) The pipelined execution of a transformer with 3 heads (H1-H3) and 4 input tokens (IP1-4).
  • Figure 4: Figure showing (a) process of generating the unique matrix and the trends in the reduction ratios for OPT-125M and OPT-1.3B LLM models across different layers in the decoder. Reduction ratios are averaged across all the decoder layers. (b) packet-specific encoding precision and (c) frequency-aware reindexing to further optimize the DRAM bandwidth.
  • Figure 5: (a) The WILU Module (b) The mode-aware unpacking (MAU) module.
  • ...and 8 more figures