Table of Contents
Fetching ...

Designing Efficient LLM Accelerators for Edge Devices

Jude Haris, Rappy Saha, Wenhao Hu, José Cano

TL;DR

This paper tackles the challenge of running large language models on resource-constrained edge devices by introducing SECDA-LLM, a design platform that employs the SECDA SystemC-based co-design methodology to rapidly prototype, simulate, and deploy FPGA-accelerated LLM inference within the llama.cpp framework. It details integration with llama.cpp, end-to-end SystemC simulation, and a hardware-evaluation flow, culminating in a case study that implements a quantized MatMul accelerator for GGML's MatMul_Q3_K_Q8_K kernel. The MatMul accelerator, optimized for block-floating-point quantization, achieves about 11× speedup over a dual-core NEON CPU on TinyLlama when deployed on a PYNQ-Z1 board. Overall, SECDA-LLM enables hardware-software co-design for edge LLM acceleration and demonstrates a practical flow from model quantization to FPGA deployment with measurable performance gains.

Abstract

The increase in open-source availability of Large Language Models (LLMs) has enabled users to deploy them on more and more resource-constrained edge devices to reduce reliance on network connections and provide more privacy. However, the high computation and memory demands of LLMs make their execution on resource-constrained edge devices challenging and inefficient. To address this issue, designing new and efficient edge accelerators for LLM inference is crucial. FPGA-based accelerators are ideal for LLM acceleration due to their reconfigurability, as they enable model-specific optimizations and higher performance per watt. However, creating and integrating FPGA-based accelerators for LLMs (particularly on edge devices) has proven challenging, mainly due to the limited hardware design flows for LLMs in existing FPGA platforms. To tackle this issue, in this paper we first propose a new design platform, named SECDA-LLM, that utilizes the SECDA methodology to streamline the process of designing, integrating, and deploying efficient FPGA-based LLM accelerators for the llama.cpp inference framework. We then demonstrate, through a case study, the potential benefits of SECDA-LLM by creating a new MatMul accelerator that supports block floating point quantized operations for LLMs. Our initial accelerator design, deployed on the PYNQ-Z1 board, reduces latency 1.7 seconds per token or ~2 seconds per word) by 11x over the dual-core Arm NEON-based CPU execution for the TinyLlama model.

Designing Efficient LLM Accelerators for Edge Devices

TL;DR

This paper tackles the challenge of running large language models on resource-constrained edge devices by introducing SECDA-LLM, a design platform that employs the SECDA SystemC-based co-design methodology to rapidly prototype, simulate, and deploy FPGA-accelerated LLM inference within the llama.cpp framework. It details integration with llama.cpp, end-to-end SystemC simulation, and a hardware-evaluation flow, culminating in a case study that implements a quantized MatMul accelerator for GGML's MatMul_Q3_K_Q8_K kernel. The MatMul accelerator, optimized for block-floating-point quantization, achieves about 11× speedup over a dual-core NEON CPU on TinyLlama when deployed on a PYNQ-Z1 board. Overall, SECDA-LLM enables hardware-software co-design for edge LLM acceleration and demonstrates a practical flow from model quantization to FPGA deployment with measurable performance gains.

Abstract

The increase in open-source availability of Large Language Models (LLMs) has enabled users to deploy them on more and more resource-constrained edge devices to reduce reliance on network connections and provide more privacy. However, the high computation and memory demands of LLMs make their execution on resource-constrained edge devices challenging and inefficient. To address this issue, designing new and efficient edge accelerators for LLM inference is crucial. FPGA-based accelerators are ideal for LLM acceleration due to their reconfigurability, as they enable model-specific optimizations and higher performance per watt. However, creating and integrating FPGA-based accelerators for LLMs (particularly on edge devices) has proven challenging, mainly due to the limited hardware design flows for LLMs in existing FPGA platforms. To tackle this issue, in this paper we first propose a new design platform, named SECDA-LLM, that utilizes the SECDA methodology to streamline the process of designing, integrating, and deploying efficient FPGA-based LLM accelerators for the llama.cpp inference framework. We then demonstrate, through a case study, the potential benefits of SECDA-LLM by creating a new MatMul accelerator that supports block floating point quantized operations for LLMs. Our initial accelerator design, deployed on the PYNQ-Z1 board, reduces latency 1.7 seconds per token or ~2 seconds per word) by 11x over the dual-core Arm NEON-based CPU execution for the TinyLlama model.
Paper Structure (18 sections, 3 figures)

This paper contains 18 sections, 3 figures.

Figures (3)

  • Figure 1: Overview of the SECDA methodology harisSECDAEfficientHardware2021. Components in the dashed lines correspond to simulation, and in the dotted lines to execution on real hardware.
  • Figure 2: Overview of SECDA-LLM. Key SECDA components are highlighted in orange, and the LLM components are highlighted in beige.
  • Figure 3: Overview of our block floating point quantized accelerator design for GGML's MatMul_Q3_K_Q8_K kernel.