Table of Contents
Fetching ...

FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs

Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang, Yadong Dai, Jintao Li, Zehao Wang, Ruoyu Zhang, Kairui Wen, Xuefei Ning, Yu Wang

TL;DR

FlightLLM tackles the inefficiency of LLM inference by delivering a complete FPGA-based mapping flow. Its main innovations are a configurable sparse DSP chain, an always-on-chip decode path with mixed-precision support, and a length-adaptive compilation strategy. The approach yields substantial energy and cost efficiency improvements over GPUs and enables larger LLMs to run on FPGA hardware. This work demonstrates the practicality of FPGA-based LLM inference with real-world models such as LLaMA2-7B and OPT-6.7B on standard FPGA platforms.

Abstract

Transformer-based Large Language Models (LLMs) have made a significant impact on various domains. However, LLMs' efficiency suffers from both heavy computation and memory overheads. Compression techniques like sparsification and quantization are commonly used to mitigate the gap between LLM's computation/memory overheads and hardware capacity. However, existing GPU and transformer-based accelerators cannot efficiently process compressed LLMs, due to the following unresolved challenges: low computational efficiency, underutilized memory bandwidth, and large compilation overheads. This paper proposes FlightLLM, enabling efficient LLMs inference with a complete mapping flow on FPGAs. In FlightLLM, we highlight an innovative solution that the computation and memory overhead of LLMs can be solved by utilizing FPGA-specific resources (e.g., DSP48 and heterogeneous memory hierarchy). We propose a configurable sparse DSP chain to support different sparsity patterns with high computation efficiency. Second, we propose an always-on-chip decode scheme to boost memory bandwidth with mixed-precision support. Finally, to make FlightLLM available for real-world LLMs, we propose a length adaptive compilation method to reduce the compilation overhead. Implemented on the Xilinx Alveo U280 FPGA, FlightLLM achieves 6.0$\times$ higher energy efficiency and 1.8$\times$ better cost efficiency against commercial GPUs (e.g., NVIDIA V100S) on modern LLMs (e.g., LLaMA2-7B) using vLLM and SmoothQuant under the batch size of one. FlightLLM beats NVIDIA A100 GPU with 1.2$\times$ higher throughput using the latest Versal VHK158 FPGA.

FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs

TL;DR

FlightLLM tackles the inefficiency of LLM inference by delivering a complete FPGA-based mapping flow. Its main innovations are a configurable sparse DSP chain, an always-on-chip decode path with mixed-precision support, and a length-adaptive compilation strategy. The approach yields substantial energy and cost efficiency improvements over GPUs and enables larger LLMs to run on FPGA hardware. This work demonstrates the practicality of FPGA-based LLM inference with real-world models such as LLaMA2-7B and OPT-6.7B on standard FPGA platforms.

Abstract

Transformer-based Large Language Models (LLMs) have made a significant impact on various domains. However, LLMs' efficiency suffers from both heavy computation and memory overheads. Compression techniques like sparsification and quantization are commonly used to mitigate the gap between LLM's computation/memory overheads and hardware capacity. However, existing GPU and transformer-based accelerators cannot efficiently process compressed LLMs, due to the following unresolved challenges: low computational efficiency, underutilized memory bandwidth, and large compilation overheads. This paper proposes FlightLLM, enabling efficient LLMs inference with a complete mapping flow on FPGAs. In FlightLLM, we highlight an innovative solution that the computation and memory overhead of LLMs can be solved by utilizing FPGA-specific resources (e.g., DSP48 and heterogeneous memory hierarchy). We propose a configurable sparse DSP chain to support different sparsity patterns with high computation efficiency. Second, we propose an always-on-chip decode scheme to boost memory bandwidth with mixed-precision support. Finally, to make FlightLLM available for real-world LLMs, we propose a length adaptive compilation method to reduce the compilation overhead. Implemented on the Xilinx Alveo U280 FPGA, FlightLLM achieves 6.0 higher energy efficiency and 1.8 better cost efficiency against commercial GPUs (e.g., NVIDIA V100S) on modern LLMs (e.g., LLaMA2-7B) using vLLM and SmoothQuant under the batch size of one. FlightLLM beats NVIDIA A100 GPU with 1.2 higher throughput using the latest Versal VHK158 FPGA.
Paper Structure (33 sections, 3 equations, 15 figures, 5 tables)

This paper contains 33 sections, 3 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: FlightLLM on Alveo U280 FPGA outperforms NVIDIA V100S GPU (using vLLM kwon2023efficient and SmoothQuant xiao2023smoothquant) with better performance and cost efficiency.
  • Figure 2: Three challenges of LLM inference on FPGAs, and the corresponding solutions in FlightLLM.
  • Figure 3: The (a) prefill and (b) decode stage of LLMs. Colored squares are weights or cached data. Gray denotes activations.
  • Figure 4: The overall architecture of FlightLLM, including task scheduler, memory controller and computing cores.
  • Figure 5: The unified Matrix Processing Engine (MPE) can perform multiple types of matrix multiplications. (a) MPE includes multiple Matrix Processing Units (MPUs), which are composed of multiple Vector Processing Units (VPUs). By configuring the MPU, the MPE can support both (b) matrix-matrix multiplication (MM) mode and (c) matrix-vector multiplication (MV) mode. (d) We utilize DSP resources on the FPGA to implement the VPU.
  • ...and 10 more figures