LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs
Han Xu, Yutong Li, Shihao Ji
TL;DR
This work tackles efficient LLM inference on embedded FPGAs by accelerating Llama2 with LlamaF. It introduces LlamaF, a group-wise W8A8 quantization scheme and a fully pipelined GQMV accelerator, augmented by asynchronous weight loading to overlap data transfers. On TinyLlama 1.1B, LlamaF achieves a 14.3–15.8x speedup and a 6.1x boost in energy efficiency over the ZCU102 PS, with only a modest 0.57% perplexity increase on WikiText-2. These results demonstrate the viability of edge-LLM inference and point to future work on accelerating multi-head attention via softmax approximation.
Abstract
Large language models (LLMs) have demonstrated remarkable abilities in natural language processing. However, their deployment on resource-constrained embedded devices remains difficult due to memory and computational demands. In this paper, we present an FPGA-based accelerator designed to improve LLM inference performance on embedded FPGAs. We employ post-training quantization to reduce model size and optimize for off-chip memory bandwidth. Our design features asynchronous computation and a fully pipelined accelerator for matrix-vector multiplication. Experiments of the TinyLlama 1.1B model on a Xilinx ZCU102 platform show a 14.3-15.8x speedup and a 6.1x power efficiency improvement over running exclusively on ZCU102 processing system (PS).
