Table of Contents
Fetching ...

FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design

Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou, Olatunji Ruwase, Yuxiong He, Shuaiwen Leon Song

TL;DR

This work tackles the memory-bound bottleneck of deploying large language models by enabling efficient six-bit (FP6) quantization with unified Tensor Core support. The authors introduce TC-FPx, a full-stack GPU kernel design that fuses FP6 weight de-quantization with matrix-multiply computations, and they integrate it into DeepSpeed to deliver FP6-LLM for end-to-end quantized LLM inference. Key innovations include ahead-of-time bit-level weight pre-packing, a SIMT-efficient runtime for rapid de-quantization, and a slice-based software pipeline that overlaps memory, de-quantization, and compute. Empirical results show substantial throughput gains: for LLaMA-70b, FP6-LLM on a single GPU achieves 1.69×–2.65× higher throughput than FP16, and OPT-30b sees 1.72×–4.05× improvements, highlighting practical impact for memory-constrained LLM deployment. Overall, the paper delivers a practical pathway to deploy FP6-quantized LLMs with significant reductions in memory footprint and end-to-end latency.

Abstract

Six-bit quantization (FP6) can effectively reduce the size of large language models (LLMs) and preserve the model quality consistently across varied applications. However, existing systems do not provide Tensor Core support for FP6 quantization and struggle to achieve practical performance improvements during LLM inference. It is challenging to support FP6 quantization on GPUs due to (1) unfriendly memory access of model weights with irregular bit-width and (2) high runtime overhead of weight de-quantization. To address these problems, we propose TC-FPx, the first full-stack GPU kernel design scheme with unified Tensor Core support of float-point weights for various quantization bit-width. We integrate TC-FPx kernel into an existing inference system, providing new end-to-end support (called FP6-LLM) for quantized LLM inference, where better trade-offs between inference cost and model quality are achieved. Experiments show that FP6-LLM enables the inference of LLaMA-70b using only a single GPU, achieving 1.69x-2.65x higher normalized inference throughput than the FP16 baseline. The source code is publicly available at https://github.com/usyd-fsalab/fp6_llm.

FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design

TL;DR

This work tackles the memory-bound bottleneck of deploying large language models by enabling efficient six-bit (FP6) quantization with unified Tensor Core support. The authors introduce TC-FPx, a full-stack GPU kernel design that fuses FP6 weight de-quantization with matrix-multiply computations, and they integrate it into DeepSpeed to deliver FP6-LLM for end-to-end quantized LLM inference. Key innovations include ahead-of-time bit-level weight pre-packing, a SIMT-efficient runtime for rapid de-quantization, and a slice-based software pipeline that overlaps memory, de-quantization, and compute. Empirical results show substantial throughput gains: for LLaMA-70b, FP6-LLM on a single GPU achieves 1.69×–2.65× higher throughput than FP16, and OPT-30b sees 1.72×–4.05× improvements, highlighting practical impact for memory-constrained LLM deployment. Overall, the paper delivers a practical pathway to deploy FP6-quantized LLMs with significant reductions in memory footprint and end-to-end latency.

Abstract

Six-bit quantization (FP6) can effectively reduce the size of large language models (LLMs) and preserve the model quality consistently across varied applications. However, existing systems do not provide Tensor Core support for FP6 quantization and struggle to achieve practical performance improvements during LLM inference. It is challenging to support FP6 quantization on GPUs due to (1) unfriendly memory access of model weights with irregular bit-width and (2) high runtime overhead of weight de-quantization. To address these problems, we propose TC-FPx, the first full-stack GPU kernel design scheme with unified Tensor Core support of float-point weights for various quantization bit-width. We integrate TC-FPx kernel into an existing inference system, providing new end-to-end support (called FP6-LLM) for quantized LLM inference, where better trade-offs between inference cost and model quality are achieved. Experiments show that FP6-LLM enables the inference of LLaMA-70b using only a single GPU, achieving 1.69x-2.65x higher normalized inference throughput than the FP16 baseline. The source code is publicly available at https://github.com/usyd-fsalab/fp6_llm.
Paper Structure (47 sections, 4 equations, 14 figures, 2 tables, 1 algorithm)

This paper contains 47 sections, 4 equations, 14 figures, 2 tables, 1 algorithm.

Figures (14)

  • Figure 1: Performance of a linear layer within the llama-65b llama1 model. The shapes of the weight/activation matrices are (8192, 22016) and (22016, Batch Size).
  • Figure 2: Two different methods to support weight-only WxA16 quantization during LLM inference. (Left) Dual kernels. (Right) Unified kernel.
  • Figure 3: Memory Access of X-bit Weights for each Thread.
  • Figure 4: Design Overview.
  • Figure 5: Ahead-of-time Bit-level Weight Pre-packing.
  • ...and 9 more figures