LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference

Zhiwen Mo; Lei Wang; Jianyu Wei; Zhichen Zeng; Shijie Cao; Lingxiao Ma; Naifeng Jing; Ting Cao; Jilong Xue; Fan Yang; Mao Yang

LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference

Zhiwen Mo, Lei Wang, Jianyu Wei, Zhichen Zeng, Shijie Cao, Lingxiao Ma, Naifeng Jing, Ting Cao, Jilong Xue, Fan Yang, Mao Yang

TL;DR

The paper tackles the challenge of accelerating low-bit LLM inference where mixed-precision multiplication (mpGEMM) is not natively supported. It introduces LUT Tensor Core, a software-hardware co-design that uses software-driven LUT precompute fusion and weight reinterpretation to shrink LUTs, a bit-serial LUT microarchitecture with elongated tiling, and LMMA instructions with a TVM/Roller/Welder-based compilation flow. The approach yields substantial gains, including 4–6× improvements in power, performance, and area, and up to 8.2× end-to-end speedups on representative LLMs, while delivering higher compute density and energy efficiency than previous LUT-based solutions. The work demonstrates broad precision flexibility and practical integration with existing inference stacks, paving the way for efficient lut-based mpGEMM in future hardware and software ecosystems.

Abstract

Large Language Model (LLM) inference becomes resource-intensive, prompting a shift toward low-bit model weights to reduce the memory footprint and improve efficiency. Such low-bit LLMs necessitate the mixed-precision matrix multiplication (mpGEMM), an important yet underexplored operation involving the multiplication of lower-precision weights with higher-precision activations. Off-the-shelf hardware does not support this operation natively, leading to indirect, thus inefficient, dequantization-based implementations. In this paper, we study the lookup table (LUT)-based approach for mpGEMM and find that a conventional LUT implementation fails to achieve the promised gains. To unlock the full potential of LUT-based mpGEMM, we propose LUT Tensor Core, a software-hardware co-design for low-bit LLM inference. LUT Tensor Core differentiates itself from conventional LUT designs through: 1) software-based optimizations to minimize table precompute overhead and weight reinterpretation to reduce table storage; 2) a LUT-based Tensor Core hardware design with an elongated tiling shape to maximize table reuse and a bit-serial design to support diverse precision combinations in mpGEMM; 3) a new instruction set and compilation optimizations for LUT-based mpGEMM. LUT Tensor Core significantly outperforms existing pure software LUT implementations and achieves a 1.44$\times$ improvement in compute density and energy efficiency compared to previous state-of-the-art LUT-based accelerators.

LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference

TL;DR

Abstract

improvement in compute density and energy efficiency compared to previous state-of-the-art LUT-based accelerators.

Paper Structure (39 sections, 10 equations, 19 figures, 5 tables)

This paper contains 39 sections, 10 equations, 19 figures, 5 tables.

Introduction
Background and Motivation
LLM Inference and Low-Bit Quantization
LUT-based mpGEMM for Low-Bit LLM
Gaps in Current LUT-based Solutions
LUT Tensor Core Design
Software-based Table Optimization
Precomputing lookup table with DFG transformation and operator fusion
Reinterpreting weight for table symmetrization
Table quantization
LUT-based Tensor Core Microarchitecture
Simplified LUT unit design with bit-serial
Elongated LUT tiling
Instruction and Compilation
LUT-based MMA instructions
...and 24 more sections

Figures (19)

Figure 1: Decoder-only transformer blocks in LLMs. The primary computations are GEMM operations (or mpGEMM operations with weight quantization).
Figure 2: (a) GEMM, (b) Indirect mpGEMM with dequantization, (c) Direct mpGEMM for low-bit LLM inference.
Figure 3: A naive LUT-based mpGEMM tile example of FP16 activations and INT1 weights. With the precomputed table, a table lookup can replace a dot product of 4-element vectors.
Figure 4: mpGEMM kernel performance with shapes M0-M3 extracted from LLAMA2-70B. $W_{INT4}A_{FP16}$ denotes INT4 weights and FP16 activations. LUT-based software kernels (LUT-GEMM) underperform dequantization-based kernels (CUTLASS) on the A100 GPU.
Figure 5: Conventional LUT hardware in three steps. Table precomputation and storage introduce heavy overhead.
...and 14 more figures

LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference

TL;DR

Abstract

LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference

Authors

TL;DR

Abstract

Table of Contents

Figures (19)