LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference
Zhiwen Mo, Lei Wang, Jianyu Wei, Zhichen Zeng, Shijie Cao, Lingxiao Ma, Naifeng Jing, Ting Cao, Jilong Xue, Fan Yang, Mao Yang
TL;DR
The paper tackles the challenge of accelerating low-bit LLM inference where mixed-precision multiplication (mpGEMM) is not natively supported. It introduces LUT Tensor Core, a software-hardware co-design that uses software-driven LUT precompute fusion and weight reinterpretation to shrink LUTs, a bit-serial LUT microarchitecture with elongated tiling, and LMMA instructions with a TVM/Roller/Welder-based compilation flow. The approach yields substantial gains, including 4–6× improvements in power, performance, and area, and up to 8.2× end-to-end speedups on representative LLMs, while delivering higher compute density and energy efficiency than previous LUT-based solutions. The work demonstrates broad precision flexibility and practical integration with existing inference stacks, paving the way for efficient lut-based mpGEMM in future hardware and software ecosystems.
Abstract
Large Language Model (LLM) inference becomes resource-intensive, prompting a shift toward low-bit model weights to reduce the memory footprint and improve efficiency. Such low-bit LLMs necessitate the mixed-precision matrix multiplication (mpGEMM), an important yet underexplored operation involving the multiplication of lower-precision weights with higher-precision activations. Off-the-shelf hardware does not support this operation natively, leading to indirect, thus inefficient, dequantization-based implementations. In this paper, we study the lookup table (LUT)-based approach for mpGEMM and find that a conventional LUT implementation fails to achieve the promised gains. To unlock the full potential of LUT-based mpGEMM, we propose LUT Tensor Core, a software-hardware co-design for low-bit LLM inference. LUT Tensor Core differentiates itself from conventional LUT designs through: 1) software-based optimizations to minimize table precompute overhead and weight reinterpretation to reduce table storage; 2) a LUT-based Tensor Core hardware design with an elongated tiling shape to maximize table reuse and a bit-serial design to support diverse precision combinations in mpGEMM; 3) a new instruction set and compilation optimizations for LUT-based mpGEMM. LUT Tensor Core significantly outperforms existing pure software LUT implementations and achieves a 1.44$\times$ improvement in compute density and energy efficiency compared to previous state-of-the-art LUT-based accelerators.
