TP-Aware Dequantization
Adnan Hoque, Mudhakar Srivatsa, Chih-Chieh Yang, Raghu Ganti
TL;DR
The paper tackles latency in distributed LLM inference by marrying data-locality enforcement with a TP-aware dequantization strategy. It builds on GPTQ-style grouping but mitigates the memory-access and communication overhead introduced by activation-order reordering through offline optimization that yields an optimized group-index mapping and a permutation P. The key innovation, TP-Aware Dequantization, reduces inter-rank communication by aligning weight and metadata sharding so that the output of the column-TP layer requires no additional AllGather before the next GEMM, particularly in MLP transformer blocks. Across Llama-70B and Granite-20B on A100 and H100 DGX systems, the method achieves up to approximately 1.81x speedups, demonstrating notable improvements in throughput and scalability for distributed LLM inference.
Abstract
In this paper, we present a novel method that reduces model inference latency during distributed deployment of Large Language Models (LLMs). Our contribution is an optimized inference deployment scheme that address the current limitations of state-of-the-art quantization kernels when used in conjunction with Tensor Parallel (TP). Our method preserves data locality in GPU memory access patterns and exploits a priori knowledge of TP to reduce global communication. We demonstrate an up to 1.81x speedup over existing methods for Llama-70B and up to 1.78x speedup for IBM WatsonX's Granite-20B MLP layer problem sizes on A100 and H100 NVIDIA DGX Systems for a variety of TP settings.
