FIGLUT: An Energy-Efficient Accelerator Design for FP-INT GEMM Using Look-Up Tables
Gunho Park, Hyeokjun Kwon, Jiwoo Kim, Jeongin Bae, Baeseong Park, Dongsoo Lee, Youngjoo Lee
TL;DR
This work tackles memory and bandwidth bottlenecks in deploying large language models by focusing on weight-only quantization, which requires FP-INT computation. It introduces FIGLUT, a LUT-based FP-INT GEMM accelerator that replaces traditional MAC with a read-accumulate unit and uses a conflict-free, flip-flop LUT (FFLUT) along with a decoding scheme and a half-size LUT (hFFLUT) to enable efficient parallelism. The design supports multiple quantization methods (including BCQ) and mixed precisions on a single hardware configuration, leveraging a 2D systolic array and a weight-stationary dataflow, with specialized LUT generation to minimize hardware overhead. Hardware and accuracy evaluations show FIGLUT achieves substantial energy efficiency improvements (up to 59% TOPS/W gain and reduced perplexity) over state-of-the-art FP-INT accelerators for sub-4-bit weights, and up to 98% higher TOPS/W for the same perplexity at 2.4-bit operations, indicating strong practical impact for memory-bound LLM inference.
Abstract
Weight-only quantization has emerged as a promising solution to the deployment challenges of large language models (LLMs). However, it necessitates FP-INT operations, which make implementation on general-purpose hardware like GPUs difficult. In this paper, we propose FIGLUT, an efficient look-up table (LUT)-based GEMM accelerator architecture. Instead of performing traditional arithmetic operations, FIGLUT retrieves precomputed values from an LUT based on weight patterns, significantly reducing the computational complexity. We also introduce a novel LUT design that addresses the limitations of conventional memory architectures. To further improve LUT-based operations, we propose a half-size LUT combined with a dedicated decoding and multiplexing unit. FIGLUT efficiently supports different bit precisions and quantization methods using a single fixed hardware configuration. For the same 3-bit weight precision, FIGLUT demonstrates 59% higher TOPS/W and 20% lower perplexity than state-of-the-art accelerator designs. When targeting the same perplexity, FIGLUT achieves 98% higher TOPS/W by performing 2.4-bit operations.
