Table of Contents
Fetching ...

Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices

Xiangyu Li, Chengyu Yin, Weijun Wang, Jianyu Wei, Ting Cao, Yunxin Liu

TL;DR

This work identifies a major inefficiency in LUT-based ultra-low-bit LLM inference: scalar LUT lookups across parallel tokens fail to fully utilize memory bandwidth. It introduces Vec-LUT, a vector LUT paradigm that performs a single 1→N lookup per weight index by sharing a unified LUT across tokens, paired with a token-contiguous tensor layout, tile-based weight packing, and cache-aware streamed lookup. Two core innovations—Vector LUT-Centric Tensor Layout and Cache-Aware Streamed Lookup—enable large kernel and end-to-end speedups, demonstrated by integration into llama.cpp and evaluations across 5 edge devices and 3 LLMs, with up to 4.2× kernel-level and 273.5 tokens/s end-to-end gains. The results suggest Vec-LUT can dramatically improve on-device LLMs on commodity CPUs, reducing the need for specialized accelerators and broadening accessible on-device intelligence for edge applications.

Abstract

Large language models (LLMs) are increasingly deployed on edge devices. To meet strict resource constraints, real-world deployment has pushed LLM quantization from 8-bit to 4-bit, 2-bit, and now 1.58-bit. Combined with lookup table (LUT)-based inference, CPUs run these ultra-low-bit LLMs even faster than NPUs, opening new opportunities for ubiquitous on-device intelligence. However, this paper identifies that LUT-based inference underutilizes memory bandwidth during parallel inference, which is required for prefilling, test-time scaling, and other multi-token scenarios. The root cause is the scalar LUT paradigm, which performs repetitive and non-contiguous memory accesses for each token. To solve the issue, we propose vector LUT, a new lookup paradigm that constructs a unified LUT across parallel tokens, and performs a single $1 \rightarrow N$ lookup per index. To realize it efficiently, we further introduce (1) Vector LUT-Centric Tensor Layout, and (2) Cache-Aware Streamed Lookup techniques. Evaluations on 5 edge devices across 3 LLMs show that Vec-LUT outperforms state-of-the-art baselines by up to $4.2\times$. Our implementation is integrated into llama.cpp. The code is available at https://github.com/Cipherxzc/vlut.cpp.

Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices

TL;DR

This work identifies a major inefficiency in LUT-based ultra-low-bit LLM inference: scalar LUT lookups across parallel tokens fail to fully utilize memory bandwidth. It introduces Vec-LUT, a vector LUT paradigm that performs a single 1→N lookup per weight index by sharing a unified LUT across tokens, paired with a token-contiguous tensor layout, tile-based weight packing, and cache-aware streamed lookup. Two core innovations—Vector LUT-Centric Tensor Layout and Cache-Aware Streamed Lookup—enable large kernel and end-to-end speedups, demonstrated by integration into llama.cpp and evaluations across 5 edge devices and 3 LLMs, with up to 4.2× kernel-level and 273.5 tokens/s end-to-end gains. The results suggest Vec-LUT can dramatically improve on-device LLMs on commodity CPUs, reducing the need for specialized accelerators and broadening accessible on-device intelligence for edge applications.

Abstract

Large language models (LLMs) are increasingly deployed on edge devices. To meet strict resource constraints, real-world deployment has pushed LLM quantization from 8-bit to 4-bit, 2-bit, and now 1.58-bit. Combined with lookup table (LUT)-based inference, CPUs run these ultra-low-bit LLMs even faster than NPUs, opening new opportunities for ubiquitous on-device intelligence. However, this paper identifies that LUT-based inference underutilizes memory bandwidth during parallel inference, which is required for prefilling, test-time scaling, and other multi-token scenarios. The root cause is the scalar LUT paradigm, which performs repetitive and non-contiguous memory accesses for each token. To solve the issue, we propose vector LUT, a new lookup paradigm that constructs a unified LUT across parallel tokens, and performs a single lookup per index. To realize it efficiently, we further introduce (1) Vector LUT-Centric Tensor Layout, and (2) Cache-Aware Streamed Lookup techniques. Evaluations on 5 edge devices across 3 LLMs show that Vec-LUT outperforms state-of-the-art baselines by up to . Our implementation is integrated into llama.cpp. The code is available at https://github.com/Cipherxzc/vlut.cpp.

Paper Structure

This paper contains 60 sections, 2 equations, 12 figures, 5 tables, 1 algorithm.

Figures (12)

  • Figure 1: Different mpGeMM paradigms for ternary LLM inference. Our vector LUT stores precomputed results from multiple tokens contiguously in the table, and performs efficient $1\rightarrow N$ lookup, instead of $N\times$ repetitive $1\rightarrow 1$ lookup in existing scalar LUT. Fig. \ref{['fig:lut-example']} and §\ref{['sec:bg-lut-preliminary']} further explain the mechanism of LUT.
  • Figure 2: A minimal example of using LUT to calculate $\mathbf{o}=\mathbf{w}\times \mathbf{v}$ with FP16 $\mathbf{v}$ and ternary $\mathbf{w}$ of size 4.
  • Figure 3: mpGeMM latency vs. BPW on an Intel PC. Vec-LUT utilizes fewer BPWs (i.e., $\le2$) for lower latency, while MAD-based llama.cpp cannot achieve speedup with fewer BPWs.
  • Figure 4: Overview of the Vec-LUT mpGeMM kernel.
  • Figure 5: Mappings among packed weights (bits and decimal), unpacked weights (ternary), and precomputed LUT rows. The LUT row index is equal to the decimal value of packed weights, avoiding extra conversion.
  • ...and 7 more figures