Table of Contents
Fetching ...

Opt4GPTQ: Co-Optimizing Memory and Computation for 4-bit GPTQ Quantized LLM Inference on Heterogeneous Platforms

Yaozheng Zhang, Wei Wang, Jie Kong, Jiehan Zhou, Huanqing Cui

TL;DR

The paper tackles the inefficiency of 4-bit GPTQ LLM inference on heterogeneous accelerators by introducing Opt4GPTQ, a platform-level optimization integrated with the vLLM serving system. It combines three optimization pillars—SMB-Opt, VML-Opt, and ILA-Opt—to reduce memory bottlenecks, accelerate data loading, and exploit hardware-native SIMD instructions on the DCU accelerator. Experimental results on the HYGON DCU Z100 across six GPTQ-quantized models show up to 84.42% throughput improvement and up to 51.35% latency reduction, with accuracy preserved within 1 percentage point. The work highlights the importance of platform-specific kernel and memory-system optimizations for deploying efficient LLM inference on emerging heterogeneous hardware and provides deployment experience and methodologies for future platforms.

Abstract

The increasing adoption of large language model (LLMs) on heterogeneous computing platforms poses significant challenges for achieving high inference efficiency. To address the low inference efficiency of LLMs across diverse heterogeneous platforms, this paper proposes a practical optimization method, Opt4GPTQ, designed for 4-bit GPTQ quantized LLMs inference on heterogeneous AI accelerators. Built upon the vLLM serving system, Opt4GPTQ integrates three platform-level optimization strategies: Shared Memory Buffering optimization (SMB-Opt), which caches data in shared memory and employs single-threaded writes; Vectorized Memory Loading optimization (VML-Opt), which utilizes vectorized memory operations for efficient data loading; and Inline Assembly optimization (ILAOpt), which directly leverages hardware-native vector halfprecision addition and fused multiply-accumulate instructions for efficient execution. Experimental results show that Opt4GPTQ effectively improves inference performance across different models, achieving up to 84.42% throughput improvement and up to 51.35% latency reduction. This work highlights the critical role of platform-level engineering optimizations in enabling efficient LLMs inference on emerging heterogeneous AI acceleration architectures and provides valuable deployment experience and methodologies for future heterogeneous platform adaptation.

Opt4GPTQ: Co-Optimizing Memory and Computation for 4-bit GPTQ Quantized LLM Inference on Heterogeneous Platforms

TL;DR

The paper tackles the inefficiency of 4-bit GPTQ LLM inference on heterogeneous accelerators by introducing Opt4GPTQ, a platform-level optimization integrated with the vLLM serving system. It combines three optimization pillars—SMB-Opt, VML-Opt, and ILA-Opt—to reduce memory bottlenecks, accelerate data loading, and exploit hardware-native SIMD instructions on the DCU accelerator. Experimental results on the HYGON DCU Z100 across six GPTQ-quantized models show up to 84.42% throughput improvement and up to 51.35% latency reduction, with accuracy preserved within 1 percentage point. The work highlights the importance of platform-specific kernel and memory-system optimizations for deploying efficient LLM inference on emerging heterogeneous hardware and provides deployment experience and methodologies for future platforms.

Abstract

The increasing adoption of large language model (LLMs) on heterogeneous computing platforms poses significant challenges for achieving high inference efficiency. To address the low inference efficiency of LLMs across diverse heterogeneous platforms, this paper proposes a practical optimization method, Opt4GPTQ, designed for 4-bit GPTQ quantized LLMs inference on heterogeneous AI accelerators. Built upon the vLLM serving system, Opt4GPTQ integrates three platform-level optimization strategies: Shared Memory Buffering optimization (SMB-Opt), which caches data in shared memory and employs single-threaded writes; Vectorized Memory Loading optimization (VML-Opt), which utilizes vectorized memory operations for efficient data loading; and Inline Assembly optimization (ILAOpt), which directly leverages hardware-native vector halfprecision addition and fused multiply-accumulate instructions for efficient execution. Experimental results show that Opt4GPTQ effectively improves inference performance across different models, achieving up to 84.42% throughput improvement and up to 51.35% latency reduction. This work highlights the critical role of platform-level engineering optimizations in enabling efficient LLMs inference on emerging heterogeneous AI acceleration architectures and provides valuable deployment experience and methodologies for future heterogeneous platform adaptation.

Paper Structure

This paper contains 21 sections, 10 equations, 3 figures, 2 tables, 3 algorithms.

Figures (3)

  • Figure 1: Atomic operation optimization
  • Figure 2: Inference throughput comparison of vLLM across different models before and after applying SMB-Opt, VML-Opt, ILA-Opt, and Opt4GPTQ.
  • Figure 3: Inference latency comparison of vLLM across different models before and after applying SMB-Opt, VML-Opt, ILA-Opt, and Opt4GPTQ.