Table of Contents
Fetching ...

Pushing up to the Limit of Memory Bandwidth and Capacity Utilization for Efficient LLM Decoding on Embedded FPGA

Jindong Li, Tenglong Li, Guobin Shen, Dongcheng Zhao, Qian Zhang, Yi Zeng

TL;DR

The paper tackles the challenge of running a 7B-parameter LLM on an embedded FPGA with 4 GB of memory and $19.2$ GB/s bandwidth, addressing both capacity and bandwidth bottlenecks during decoding. It introduces a hardware-software co-design on the KV260 that combines W4A16 weight quantization and KV8 cache quantization with a burst-oriented dataflow and a data arrangement format to maximize memory throughput, all implemented in a bare-metal environment. The result is a LLaMA2-7B inference pipeline that decodes at about $5$ tokens/s while using ~93.3% of memory capacity and achieving ~85% of the theoretical bandwidth limit, marking the first deployment of a 7B LLM on an embedded FPGA. The work provides practical insights and architectural guidelines for future edge-LM accelerators, highlighting memory-centric design as a key enabler for efficient on-device decoding and potential pathways for larger models with advanced memory technologies.

Abstract

The extremely high computational and storage demands of large language models have excluded most edge devices, which were widely used for efficient machine learning, from being viable options. A typical edge device usually only has 4GB of memory capacity and a bandwidth of less than 20GB/s, while a large language model quantized to 4-bit precision with 7B parameters already requires 3.5GB of capacity, and its decoding process is purely bandwidth-bound. In this paper, we aim to explore these limits by proposing a hardware accelerator for large language model (LLM) inference on the Zynq-based KV260 platform, equipped with 4GB of 64-bit 2400Mbps DDR4 memory. We successfully deploy a LLaMA2-7B model, achieving a decoding speed of around 5 token/s, utilizing 93.3% of the memory capacity and reaching 85% decoding speed of the theoretical memory bandwidth limit. To fully reserve the memory capacity for model weights and key-value cache, we develop the system in a bare-metal environment without an operating system. To fully reserve the bandwidth for model weight transfers, we implement a customized dataflow with an operator fusion pipeline and propose a data arrangement format that can maximize the data transaction efficiency. This research marks the first attempt to deploy a 7B level LLM on a standalone embedded field programmable gate array (FPGA) device. It provides key insights into efficient LLM inference on embedded FPGA devices and provides guidelines for future architecture design.

Pushing up to the Limit of Memory Bandwidth and Capacity Utilization for Efficient LLM Decoding on Embedded FPGA

TL;DR

The paper tackles the challenge of running a 7B-parameter LLM on an embedded FPGA with 4 GB of memory and GB/s bandwidth, addressing both capacity and bandwidth bottlenecks during decoding. It introduces a hardware-software co-design on the KV260 that combines W4A16 weight quantization and KV8 cache quantization with a burst-oriented dataflow and a data arrangement format to maximize memory throughput, all implemented in a bare-metal environment. The result is a LLaMA2-7B inference pipeline that decodes at about tokens/s while using ~93.3% of memory capacity and achieving ~85% of the theoretical bandwidth limit, marking the first deployment of a 7B LLM on an embedded FPGA. The work provides practical insights and architectural guidelines for future edge-LM accelerators, highlighting memory-centric design as a key enabler for efficient on-device decoding and potential pathways for larger models with advanced memory technologies.

Abstract

The extremely high computational and storage demands of large language models have excluded most edge devices, which were widely used for efficient machine learning, from being viable options. A typical edge device usually only has 4GB of memory capacity and a bandwidth of less than 20GB/s, while a large language model quantized to 4-bit precision with 7B parameters already requires 3.5GB of capacity, and its decoding process is purely bandwidth-bound. In this paper, we aim to explore these limits by proposing a hardware accelerator for large language model (LLM) inference on the Zynq-based KV260 platform, equipped with 4GB of 64-bit 2400Mbps DDR4 memory. We successfully deploy a LLaMA2-7B model, achieving a decoding speed of around 5 token/s, utilizing 93.3% of the memory capacity and reaching 85% decoding speed of the theoretical memory bandwidth limit. To fully reserve the memory capacity for model weights and key-value cache, we develop the system in a bare-metal environment without an operating system. To fully reserve the bandwidth for model weight transfers, we implement a customized dataflow with an operator fusion pipeline and propose a data arrangement format that can maximize the data transaction efficiency. This research marks the first attempt to deploy a 7B level LLM on a standalone embedded field programmable gate array (FPGA) device. It provides key insights into efficient LLM inference on embedded FPGA devices and provides guidelines for future architecture design.

Paper Structure

This paper contains 23 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: LLaMA2-7B inference on embedded KV260 platform, with 93.3% of the memory capacity occupied and 85% of the memory bandwidth utilization.
  • Figure 2: LLM Inference Process of a LLaMA-like model. A) The prefill phase. B) The decode phase. C) Inference process breakdown of a single layer.
  • Figure 3: The pipelining dataflow in the attention layer, with all the miscellaneous process hidden in the dense computation to avoid cycle penalties.
  • Figure 4: Bus-width Aligned Data Arrangement Format. A) Compact model weight arrangement format interleaving zero points, scales, and weights. B) KV cache scale-zero packing process to minimize scalar data transfers.
  • Figure 5: Hardware Architecture of the Accelerator. A) Memory Control Unit (shaded in light purple) ensures full access to DDR bandwidth. B) Vector Processing Unit (shaded in light blue) performs dense computations. C) Scalar Processing Unit (shaded in light orange) handles miscellaneous processes.