Pushing up to the Limit of Memory Bandwidth and Capacity Utilization for Efficient LLM Decoding on Embedded FPGA
Jindong Li, Tenglong Li, Guobin Shen, Dongcheng Zhao, Qian Zhang, Yi Zeng
TL;DR
The paper tackles the challenge of running a 7B-parameter LLM on an embedded FPGA with 4 GB of memory and $19.2$ GB/s bandwidth, addressing both capacity and bandwidth bottlenecks during decoding. It introduces a hardware-software co-design on the KV260 that combines W4A16 weight quantization and KV8 cache quantization with a burst-oriented dataflow and a data arrangement format to maximize memory throughput, all implemented in a bare-metal environment. The result is a LLaMA2-7B inference pipeline that decodes at about $5$ tokens/s while using ~93.3% of memory capacity and achieving ~85% of the theoretical bandwidth limit, marking the first deployment of a 7B LLM on an embedded FPGA. The work provides practical insights and architectural guidelines for future edge-LM accelerators, highlighting memory-centric design as a key enabler for efficient on-device decoding and potential pathways for larger models with advanced memory technologies.
Abstract
The extremely high computational and storage demands of large language models have excluded most edge devices, which were widely used for efficient machine learning, from being viable options. A typical edge device usually only has 4GB of memory capacity and a bandwidth of less than 20GB/s, while a large language model quantized to 4-bit precision with 7B parameters already requires 3.5GB of capacity, and its decoding process is purely bandwidth-bound. In this paper, we aim to explore these limits by proposing a hardware accelerator for large language model (LLM) inference on the Zynq-based KV260 platform, equipped with 4GB of 64-bit 2400Mbps DDR4 memory. We successfully deploy a LLaMA2-7B model, achieving a decoding speed of around 5 token/s, utilizing 93.3% of the memory capacity and reaching 85% decoding speed of the theoretical memory bandwidth limit. To fully reserve the memory capacity for model weights and key-value cache, we develop the system in a bare-metal environment without an operating system. To fully reserve the bandwidth for model weight transfers, we implement a customized dataflow with an operator fusion pipeline and propose a data arrangement format that can maximize the data transaction efficiency. This research marks the first attempt to deploy a 7B level LLM on a standalone embedded field programmable gate array (FPGA) device. It provides key insights into efficient LLM inference on embedded FPGA devices and provides guidelines for future architecture design.
