Table of Contents
Fetching ...

On-Device Qwen2.5: Efficient LLM Inference with Model Compression and Hardware Acceleration

Maoyang Xiang, Ramesh Fernando, Bo Wang

TL;DR

The paper tackles the challenge of deploying large LLMs on edge devices by proposing an end-to-end framework for Qwen2.5-0.5B on the Xilinx KV260 that leverages Activation-aware Weight Quantization (AWQ) and FPGA acceleration. It introduces a software-hardware co-design with AWQ-based weight packing and a pipelined, dequantizing MAC accelerator in the FPGA, complemented by a hybrid CPU-FPGA execution strategy. Key results show a 55.1% reduction in model size and a throughput increase to 5.1 tokens per second, nearly doubling performance with a modest accuracy drop. The approach enables practical, real-time, privacy-preserving edge inference for modern LLMs in resource-constrained environments, highlighting the viability of on-device LLMs through tailored quantization and hardware specialization.

Abstract

Transformer-based Large Language Models (LLMs) have significantly advanced AI capabilities but pose considerable challenges for deployment on edge devices due to high computational demands, memory bandwidth constraints, and energy consumption. This paper addresses these challenges by presenting an efficient framework for deploying the Qwen2.5-0.5B model on the Xilinx Kria KV260 edge platform, a heterogeneous system integrating an ARM Cortex-A53 CPU with reconfigurable FPGA logic. Leveraging Activation-aware Weight Quantization (AWQ) with FPGA-accelerated execution pipelines, the proposed approach enhances both model compression rate and system throughput. Additionally, we propose a hybrid execution strategy that intelligently offloads compute-intensive operations to the FPGA while utilizing the CPU for lighter tasks, effectively balancing the computational workload and maximizing overall performance. Our framework achieves a model compression rate of 55.08% compared to the original model and produces output at a rate of 5.1 tokens per second, outperforming the baseline performance of 2.8 tokens per second.

On-Device Qwen2.5: Efficient LLM Inference with Model Compression and Hardware Acceleration

TL;DR

The paper tackles the challenge of deploying large LLMs on edge devices by proposing an end-to-end framework for Qwen2.5-0.5B on the Xilinx KV260 that leverages Activation-aware Weight Quantization (AWQ) and FPGA acceleration. It introduces a software-hardware co-design with AWQ-based weight packing and a pipelined, dequantizing MAC accelerator in the FPGA, complemented by a hybrid CPU-FPGA execution strategy. Key results show a 55.1% reduction in model size and a throughput increase to 5.1 tokens per second, nearly doubling performance with a modest accuracy drop. The approach enables practical, real-time, privacy-preserving edge inference for modern LLMs in resource-constrained environments, highlighting the viability of on-device LLMs through tailored quantization and hardware specialization.

Abstract

Transformer-based Large Language Models (LLMs) have significantly advanced AI capabilities but pose considerable challenges for deployment on edge devices due to high computational demands, memory bandwidth constraints, and energy consumption. This paper addresses these challenges by presenting an efficient framework for deploying the Qwen2.5-0.5B model on the Xilinx Kria KV260 edge platform, a heterogeneous system integrating an ARM Cortex-A53 CPU with reconfigurable FPGA logic. Leveraging Activation-aware Weight Quantization (AWQ) with FPGA-accelerated execution pipelines, the proposed approach enhances both model compression rate and system throughput. Additionally, we propose a hybrid execution strategy that intelligently offloads compute-intensive operations to the FPGA while utilizing the CPU for lighter tasks, effectively balancing the computational workload and maximizing overall performance. Our framework achieves a model compression rate of 55.08% compared to the original model and produces output at a rate of 5.1 tokens per second, outperforming the baseline performance of 2.8 tokens per second.

Paper Structure

This paper contains 10 sections, 1 equation, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Simplified hardware architecture of Xilinx Kria K26 system-on-module (SOM) adapted from kria-k26-datasheet.
  • Figure 2: AWQ uncovers that (a) keeping only 1% of salient weights in FP16 can achieve a similar accuracy compared to the original model but not hardware friendly; (b) following activation awareness and performing per-channel scaling can protect the salient weights and reduce quantization error. The figure is adapted from work lin2023awq.
  • Figure 3: Customized weight compression in the memory via AWQ_MACRO, a block with scales, zeros, and quantized weights in INT4.
  • Figure 4: (a) Order of loading AWQ_MACRO to the MAC units using 4 AXI channels (b) Unpacking unit which unpacks the AWQ_MACRO (c) MACRO MAC Unit with PE arrays, a sliding window in grey is used for accumulation of p_sum along the columns using the adder tree (d) Operations performed inside of a PE element.