MEADOW: Memory-efficient Dataflow and Data Packing for Low Power Edge LLMs
Abhishek Moitra, Arkapravo Ghosh, Shrey Agarwal, Aporva Amarnath, Karthik Swaminathan, Priyadarshini Panda
TL;DR
MEADOW addresses the memory and bandwidth bottlenecks of edge-Large Language Model inference by introducing a token-parallel head-sequential (TPHS) dataflow for Q/QKT/SM/SMxV layers and a weight packing scheme that decomposes weight matrices into unique chunks with packet-specific precision and frequency-aware re-indexing. The architecture combines a GEMM path for KV/Proj/MLP with TPHS for the remaining layers, supported by a Weight Unpacking/Index Look-up unit and a NoC-enabled tiled PE array, achieving substantial reductions in off-chip data transfers. On a sub-10W Xilinx ZCU102, MEADOW delivers up to about 2.5x prefill and 1.5x decode latency improvements and over 40% end-to-end gains relative to prior LLM optimizations, while also showing ViT latency benefits. This approach enables practical edge deployment of LLMs and related vision transformers by significantly reducing memory traffic without sacrificing accuracy, broadening the feasible applications for low-power edge AI.
Abstract
The computational and memory challenges of large language models (LLMs) have sparked several optimization approaches towards their efficient implementation. While prior LLM-targeted quantization, and prior works on sparse acceleration have significantly mitigated the memory and computation bottleneck, they do so assuming high power platforms such as GPUs and server-class FPGAs with large off-chip memory bandwidths and employ a generalized matrix multiplication (GEMM) execution of all the layers in the decoder. In such a GEMM-based execution, data is fetched from an off-chip memory, computed and stored back. However, at reduced off-chip memory capacities, as is the case with low-power edge devices, this implementation strategy significantly increases the attention computation latency owing to the repeated storage and fetch of large intermediate tokens to and from the off-chip memory. Moreover, fetching the weight matrices from a bandwidth constrained memory further aggravates the memory bottleneck problem. To this end, we introduce MEADOW, a framework that significantly reduces the off-chip memory access for LLMs with a novel token-parallel head-sequential (TPHS) dataflow. Additionally, MEADOW applies weight packing that performs loss-less decomposition of large weight matrices to their unique elements thereby, reducing the enormous weight fetch latency. MEADOW demonstrates 1.5x and 2.5x lower decode and prefill latency, respectively, compared to a GEMM-based LLM implementation on the low power Xilinx ZCU102 FPGA platform that consumes less than 10W. Additionally, MEADOW achieves an end-to-end latency improvement of over 40%, compared to prior LLM optimization works.
