Table of Contents
Fetching ...

TP-Aware Dequantization

Adnan Hoque, Mudhakar Srivatsa, Chih-Chieh Yang, Raghu Ganti

TL;DR

The paper tackles latency in distributed LLM inference by marrying data-locality enforcement with a TP-aware dequantization strategy. It builds on GPTQ-style grouping but mitigates the memory-access and communication overhead introduced by activation-order reordering through offline optimization that yields an optimized group-index mapping and a permutation P. The key innovation, TP-Aware Dequantization, reduces inter-rank communication by aligning weight and metadata sharding so that the output of the column-TP layer requires no additional AllGather before the next GEMM, particularly in MLP transformer blocks. Across Llama-70B and Granite-20B on A100 and H100 DGX systems, the method achieves up to approximately 1.81x speedups, demonstrating notable improvements in throughput and scalability for distributed LLM inference.

Abstract

In this paper, we present a novel method that reduces model inference latency during distributed deployment of Large Language Models (LLMs). Our contribution is an optimized inference deployment scheme that address the current limitations of state-of-the-art quantization kernels when used in conjunction with Tensor Parallel (TP). Our method preserves data locality in GPU memory access patterns and exploits a priori knowledge of TP to reduce global communication. We demonstrate an up to 1.81x speedup over existing methods for Llama-70B and up to 1.78x speedup for IBM WatsonX's Granite-20B MLP layer problem sizes on A100 and H100 NVIDIA DGX Systems for a variety of TP settings.

TP-Aware Dequantization

TL;DR

The paper tackles latency in distributed LLM inference by marrying data-locality enforcement with a TP-aware dequantization strategy. It builds on GPTQ-style grouping but mitigates the memory-access and communication overhead introduced by activation-order reordering through offline optimization that yields an optimized group-index mapping and a permutation P. The key innovation, TP-Aware Dequantization, reduces inter-rank communication by aligning weight and metadata sharding so that the output of the column-TP layer requires no additional AllGather before the next GEMM, particularly in MLP transformer blocks. Across Llama-70B and Granite-20B on A100 and H100 DGX systems, the method achieves up to approximately 1.81x speedups, demonstrating notable improvements in throughput and scalability for distributed LLM inference.

Abstract

In this paper, we present a novel method that reduces model inference latency during distributed deployment of Large Language Models (LLMs). Our contribution is an optimized inference deployment scheme that address the current limitations of state-of-the-art quantization kernels when used in conjunction with Tensor Parallel (TP). Our method preserves data locality in GPU memory access patterns and exploits a priori knowledge of TP to reduce global communication. We demonstrate an up to 1.81x speedup over existing methods for Llama-70B and up to 1.78x speedup for IBM WatsonX's Granite-20B MLP layer problem sizes on A100 and H100 NVIDIA DGX Systems for a variety of TP settings.
Paper Structure (18 sections, 3 equations, 8 figures, 28 tables, 3 algorithms)

This paper contains 18 sections, 3 equations, 8 figures, 28 tables, 3 algorithms.

Figures (8)

  • Figure 1: Naive Load with Activation Order Flag
  • Figure 2: Optimized Load with Activation Order Flag
  • Figure 3: Transformer Block
  • Figure 4: TP-Aware Model Parallelism
  • Figure 5: Latency Difference for Llama-70B, A100
  • ...and 3 more figures