Table of Contents
Fetching ...

FIGLUT: An Energy-Efficient Accelerator Design for FP-INT GEMM Using Look-Up Tables

Gunho Park, Hyeokjun Kwon, Jiwoo Kim, Jeongin Bae, Baeseong Park, Dongsoo Lee, Youngjoo Lee

TL;DR

This work tackles memory and bandwidth bottlenecks in deploying large language models by focusing on weight-only quantization, which requires FP-INT computation. It introduces FIGLUT, a LUT-based FP-INT GEMM accelerator that replaces traditional MAC with a read-accumulate unit and uses a conflict-free, flip-flop LUT (FFLUT) along with a decoding scheme and a half-size LUT (hFFLUT) to enable efficient parallelism. The design supports multiple quantization methods (including BCQ) and mixed precisions on a single hardware configuration, leveraging a 2D systolic array and a weight-stationary dataflow, with specialized LUT generation to minimize hardware overhead. Hardware and accuracy evaluations show FIGLUT achieves substantial energy efficiency improvements (up to 59% TOPS/W gain and reduced perplexity) over state-of-the-art FP-INT accelerators for sub-4-bit weights, and up to 98% higher TOPS/W for the same perplexity at 2.4-bit operations, indicating strong practical impact for memory-bound LLM inference.

Abstract

Weight-only quantization has emerged as a promising solution to the deployment challenges of large language models (LLMs). However, it necessitates FP-INT operations, which make implementation on general-purpose hardware like GPUs difficult. In this paper, we propose FIGLUT, an efficient look-up table (LUT)-based GEMM accelerator architecture. Instead of performing traditional arithmetic operations, FIGLUT retrieves precomputed values from an LUT based on weight patterns, significantly reducing the computational complexity. We also introduce a novel LUT design that addresses the limitations of conventional memory architectures. To further improve LUT-based operations, we propose a half-size LUT combined with a dedicated decoding and multiplexing unit. FIGLUT efficiently supports different bit precisions and quantization methods using a single fixed hardware configuration. For the same 3-bit weight precision, FIGLUT demonstrates 59% higher TOPS/W and 20% lower perplexity than state-of-the-art accelerator designs. When targeting the same perplexity, FIGLUT achieves 98% higher TOPS/W by performing 2.4-bit operations.

FIGLUT: An Energy-Efficient Accelerator Design for FP-INT GEMM Using Look-Up Tables

TL;DR

This work tackles memory and bandwidth bottlenecks in deploying large language models by focusing on weight-only quantization, which requires FP-INT computation. It introduces FIGLUT, a LUT-based FP-INT GEMM accelerator that replaces traditional MAC with a read-accumulate unit and uses a conflict-free, flip-flop LUT (FFLUT) along with a decoding scheme and a half-size LUT (hFFLUT) to enable efficient parallelism. The design supports multiple quantization methods (including BCQ) and mixed precisions on a single hardware configuration, leveraging a 2D systolic array and a weight-stationary dataflow, with specialized LUT generation to minimize hardware overhead. Hardware and accuracy evaluations show FIGLUT achieves substantial energy efficiency improvements (up to 59% TOPS/W gain and reduced perplexity) over state-of-the-art FP-INT accelerators for sub-4-bit weights, and up to 98% higher TOPS/W for the same perplexity at 2.4-bit operations, indicating strong practical impact for memory-bound LLM inference.

Abstract

Weight-only quantization has emerged as a promising solution to the deployment challenges of large language models (LLMs). However, it necessitates FP-INT operations, which make implementation on general-purpose hardware like GPUs difficult. In this paper, we propose FIGLUT, an efficient look-up table (LUT)-based GEMM accelerator architecture. Instead of performing traditional arithmetic operations, FIGLUT retrieves precomputed values from an LUT based on weight patterns, significantly reducing the computational complexity. We also introduce a novel LUT design that addresses the limitations of conventional memory architectures. To further improve LUT-based operations, we propose a half-size LUT combined with a dedicated decoding and multiplexing unit. FIGLUT efficiently supports different bit precisions and quantization methods using a single fixed hardware configuration. For the same 3-bit weight precision, FIGLUT demonstrates 59% higher TOPS/W and 20% lower perplexity than state-of-the-art accelerator designs. When targeting the same perplexity, FIGLUT achieves 98% higher TOPS/W by performing 2.4-bit operations.

Paper Structure

This paper contains 18 sections, 5 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Extension of binary-coding quantization to support both non-uniform and uniform quantization formats, achieved by including a offset term ($q=3$).
  • Figure 2: (a)Best case and (b)worst case of 4 bank shared memory accessing situation with 4 threads.
  • Figure 3: Illustration of look-up table based FP-INT GEMM.
  • Figure 4: Overall architecture of FIGLUT
  • Figure 5: Illustration of the input and weight tile fetching sequence in (a) a systolic array accelerator for INT weight and (b) FIGLUT for BCQ weight. The arrows indicate the order in which the tiles are processed.
  • ...and 11 more figures