Table of Contents
Fetching ...

LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs

Han Xu, Yutong Li, Shihao Ji

TL;DR

This work tackles efficient LLM inference on embedded FPGAs by accelerating Llama2 with LlamaF. It introduces LlamaF, a group-wise W8A8 quantization scheme and a fully pipelined GQMV accelerator, augmented by asynchronous weight loading to overlap data transfers. On TinyLlama 1.1B, LlamaF achieves a 14.3–15.8x speedup and a 6.1x boost in energy efficiency over the ZCU102 PS, with only a modest 0.57% perplexity increase on WikiText-2. These results demonstrate the viability of edge-LLM inference and point to future work on accelerating multi-head attention via softmax approximation.

Abstract

Large language models (LLMs) have demonstrated remarkable abilities in natural language processing. However, their deployment on resource-constrained embedded devices remains difficult due to memory and computational demands. In this paper, we present an FPGA-based accelerator designed to improve LLM inference performance on embedded FPGAs. We employ post-training quantization to reduce model size and optimize for off-chip memory bandwidth. Our design features asynchronous computation and a fully pipelined accelerator for matrix-vector multiplication. Experiments of the TinyLlama 1.1B model on a Xilinx ZCU102 platform show a 14.3-15.8x speedup and a 6.1x power efficiency improvement over running exclusively on ZCU102 processing system (PS).

LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs

TL;DR

This work tackles efficient LLM inference on embedded FPGAs by accelerating Llama2 with LlamaF. It introduces LlamaF, a group-wise W8A8 quantization scheme and a fully pipelined GQMV accelerator, augmented by asynchronous weight loading to overlap data transfers. On TinyLlama 1.1B, LlamaF achieves a 14.3–15.8x speedup and a 6.1x boost in energy efficiency over the ZCU102 PS, with only a modest 0.57% perplexity increase on WikiText-2. These results demonstrate the viability of edge-LLM inference and point to future work on accelerating multi-head attention via softmax approximation.

Abstract

Large language models (LLMs) have demonstrated remarkable abilities in natural language processing. However, their deployment on resource-constrained embedded devices remains difficult due to memory and computational demands. In this paper, we present an FPGA-based accelerator designed to improve LLM inference performance on embedded FPGAs. We employ post-training quantization to reduce model size and optimize for off-chip memory bandwidth. Our design features asynchronous computation and a fully pipelined accelerator for matrix-vector multiplication. Experiments of the TinyLlama 1.1B model on a Xilinx ZCU102 platform show a 14.3-15.8x speedup and a 6.1x power efficiency improvement over running exclusively on ZCU102 processing system (PS).
Paper Structure (21 sections, 3 equations, 3 figures, 6 tables, 3 algorithms)

This paper contains 21 sections, 3 equations, 3 figures, 6 tables, 3 algorithms.

Figures (3)

  • Figure 1: Llama2 Architecture: Forward Pass.
  • Figure 2: Comparison of synchronous vs. asynchronous FPGA computation.
  • Figure 3: Overview of LlamaF hardware design.