Table of Contents
Fetching ...

LUT-LLM: Efficient Large Language Model Inference with Memory-based Computations on FPGAs

Zifan He, Shengyu Ye, Rui Ma, Yang Wang, Jason Cong

TL;DR

LUT-LLM tackles the challenge of efficient on-device LLM inference by shifting computation from arithmetic to memory-based lookups on an FPGA. It introduces activation-weight co-quantization with 2D lookup tables and a spatial-temporal hybrid architecture to achieve high throughput for 1B+ models, demonstrated on the Qwen-3 1.7B with the AMD V80. The approach yields about 1.66x–1.72x improvements in latency and energy efficiency over contemporary GPUs and scales to 32B models with substantial efficiency gains, highlighting a viable edge-friendly pathway for LLM deployment. This work broadens the design space for LLM acceleration by showing how memory-centric FPGA compute can outperform traditional arithmetic cores on long sequences and underlines the practical impact of memory bandwidth and on-chip lookups for future AI hardware.

Abstract

The rapid progress of large language models (LLMs) has advanced numerous applications, yet efficient single-batch inference remains vital for on-device intelligence. While FPGAs offer fine-grained data control and high energy efficiency, recent GPU optimizations have narrowed their advantage, especially under arithmetic-based computation. To overcome this, we leverage FPGAs' abundant on-chip memory to shift LLM inference from arithmetic- to memory-based computation through table lookups. We present LUT-LLM, the first FPGA accelerator enabling 1B+ LLM inference via vector-quantized memory operations. Our analysis identifies activation-weight co-quantization as the most effective scheme, supported by (1) bandwidth-aware parallel centroid search, (2) efficient 2D table lookups, and (3) a spatial-temporal hybrid design minimizing data caching. Implemented on an AMD V80 FPGA for a customized Qwen 3 1.7B model, LUT-LLM achieves 1.66x lower latency than AMD MI210 and 1.72x higher energy efficiency than NVIDIA A100, scaling to 32B models with 2.16x efficiency gain over A100.

LUT-LLM: Efficient Large Language Model Inference with Memory-based Computations on FPGAs

TL;DR

LUT-LLM tackles the challenge of efficient on-device LLM inference by shifting computation from arithmetic to memory-based lookups on an FPGA. It introduces activation-weight co-quantization with 2D lookup tables and a spatial-temporal hybrid architecture to achieve high throughput for 1B+ models, demonstrated on the Qwen-3 1.7B with the AMD V80. The approach yields about 1.66x–1.72x improvements in latency and energy efficiency over contemporary GPUs and scales to 32B models with substantial efficiency gains, highlighting a viable edge-friendly pathway for LLM deployment. This work broadens the design space for LLM acceleration by showing how memory-centric FPGA compute can outperform traditional arithmetic cores on long sequences and underlines the practical impact of memory bandwidth and on-chip lookups for future AI hardware.

Abstract

The rapid progress of large language models (LLMs) has advanced numerous applications, yet efficient single-batch inference remains vital for on-device intelligence. While FPGAs offer fine-grained data control and high energy efficiency, recent GPU optimizations have narrowed their advantage, especially under arithmetic-based computation. To overcome this, we leverage FPGAs' abundant on-chip memory to shift LLM inference from arithmetic- to memory-based computation through table lookups. We present LUT-LLM, the first FPGA accelerator enabling 1B+ LLM inference via vector-quantized memory operations. Our analysis identifies activation-weight co-quantization as the most effective scheme, supported by (1) bandwidth-aware parallel centroid search, (2) efficient 2D table lookups, and (3) a spatial-temporal hybrid design minimizing data caching. Implemented on an AMD V80 FPGA for a customized Qwen 3 1.7B model, LUT-LLM achieves 1.66x lower latency than AMD MI210 and 1.72x higher energy efficiency than NVIDIA A100, scaling to 32B models with 2.16x efficiency gain over A100.

Paper Structure

This paper contains 28 sections, 13 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Motivations and challenges of memory-based computation for LLM inference on FPGA, with the corresponding solutions as the technical contributions in LUT-LLM.
  • Figure 2: Linear projection with weight vector quantization.
  • Figure 3: Linear projection with activation vector quantization. Precomputed dot products between weight matrix and centroids in the codebook are stored in the lookup tables.
  • Figure 4: Linear projection with both activation and weight vector quantization with 2D lookup tables.
  • Figure 5: Roofline model of Qwen 3 1.7B on AMD V80 FPGA with activation VQ for memory-based compute and arithmetic-based compute in FP16. Memory-based compute can attain higher throughputs for long sequence, but is highly memory bounded for decoding.
  • ...and 9 more figures