Table of Contents
Fetching ...

T-MAN: Enabling End-to-End Low-Bit LLM Inference on NPUs via Unified Table Lookup

Jianyu Wei, Qingtao Li, Shijie Cao, Lingxiao Ma, Zixu Hao, Yanyong Zhang, Xiaoyan Hu, Ting Cao

TL;DR

NPUs on mobile devices provide high GEMM throughput but struggle with low-bit LLM inference due to dequantization and data layout mismatches. T-MAN proposes a fused two-level LUT dequantization and a concurrency-guided tiling strategy to enable end-to-end on-device inference for both prefill and decoding using a single weight copy. The approach delivers up to 1.4x speedup for prefill, 3.1x for decoding, and up to 84% energy savings compared to strong baselines, across multiple models and devices. This work enables practical, energy-efficient on-device LLM inference and lays out a roadmap for hardware-friendly LUT based acceleration on NPUs.

Abstract

Large language models (LLMs) are increasingly deployed on customer devices. To support them, current devices are adopting SoCs (System on Chip) with NPUs (Neural Processing Unit) installed. Although high performance is expected, LLM inference on NPUs is slower than its CPU counterpart. The reason is that NPUs have poor performance on computations other than GEMM, like dequantization. Current works either disaggregate prefill on the NPUs and decoding on the CPUs, or put both on the NPUs but with an accuracy loss. To solve this issue, based on the insight that low-bit can enable target computation encoded within an acceptably sized table, we propose table lookup to subsume hardware operations otherwise unsupported. To realize this, we overcome the conflicting hardware behavior of prefill and decoding to design a unified table layout and tiling through (1) fused two-level table-based dequantization and (2) concurrency-hierarchy-guided tiling. Based on that, we implement the prefill phase by three-stage pipeline and map the table-lookup-based decoding to NPU's vector units. Results show 1.4x and 3.1x speedup for prefill and decoding respectively, and 84% energy savings compared to the baseline NPU methods. The code is available at https://github.com/microsoft/T-MAC/tree/main/t-man.

T-MAN: Enabling End-to-End Low-Bit LLM Inference on NPUs via Unified Table Lookup

TL;DR

NPUs on mobile devices provide high GEMM throughput but struggle with low-bit LLM inference due to dequantization and data layout mismatches. T-MAN proposes a fused two-level LUT dequantization and a concurrency-guided tiling strategy to enable end-to-end on-device inference for both prefill and decoding using a single weight copy. The approach delivers up to 1.4x speedup for prefill, 3.1x for decoding, and up to 84% energy savings compared to strong baselines, across multiple models and devices. This work enables practical, energy-efficient on-device LLM inference and lays out a roadmap for hardware-friendly LUT based acceleration on NPUs.

Abstract

Large language models (LLMs) are increasingly deployed on customer devices. To support them, current devices are adopting SoCs (System on Chip) with NPUs (Neural Processing Unit) installed. Although high performance is expected, LLM inference on NPUs is slower than its CPU counterpart. The reason is that NPUs have poor performance on computations other than GEMM, like dequantization. Current works either disaggregate prefill on the NPUs and decoding on the CPUs, or put both on the NPUs but with an accuracy loss. To solve this issue, based on the insight that low-bit can enable target computation encoded within an acceptably sized table, we propose table lookup to subsume hardware operations otherwise unsupported. To realize this, we overcome the conflicting hardware behavior of prefill and decoding to design a unified table layout and tiling through (1) fused two-level table-based dequantization and (2) concurrency-hierarchy-guided tiling. Based on that, we implement the prefill phase by three-stage pipeline and map the table-lookup-based decoding to NPU's vector units. Results show 1.4x and 3.1x speedup for prefill and decoding respectively, and 84% energy savings compared to the baseline NPU methods. The code is available at https://github.com/microsoft/T-MAC/tree/main/t-man.

Paper Structure

This paper contains 49 sections, 1 equation, 17 figures, 4 tables.

Figures (17)

  • Figure 1: T-MAN versus current practice. To maintain accuracy, current practice offloads decoding phase to CPU and store two weight copies for NPU and CPU respectively. T-MAN leverages table lookup to enable both prefill and decoding on the NPU and only keep one weight copy.
  • Figure 2: Bit-serial table lookup to implement GEMM. Weight is decomposed into one-bit matrices, serving as the indices to precomputed results stored in tables to look up.
  • Figure 3: A typical NPU architecture, shown with Snapdragon8 NPU. It integrates the matrix core (HMX), vector cores (HVX), scalar units, and the on-chip memory (TCM).
  • Figure 4: Current practices in leveraging NPUs for low-bit LLM inference, illustrated with SOTA frameworks: Qualcomm QNN from industry and llm.npu from academia.
  • Figure 5: Latency breakdown for a mpGEMV of size $4096 \times 4096 \times 1$, comparing NPU and CPU performance. The total latency is segmented into memory loading (MEM), dequantization (DQ), and computation (CMP).
  • ...and 12 more figures