Table of Contents
Fetching ...

SPD: Sync-Point Drop for Efficient Tensor Parallelism of Large Language Models

Han-Byul Kim, Duc Hoang, Arnav Kundu, Mohammad Samragh, Minsik Cho

TL;DR

SPD addresses the communication bottleneck in tensor-parallel LLM inference by removing sync-points after self-attention. It introduces a block-designed SPD framework and a block-wise sensitivity strategy that classifies blocks into ISB, SB, and ESB, applying zero-shot dropping, distillation, and head grouping to recover accuracy. Empirical results on LLaMA2 and OPT show up to ~20% end-to-end latency reduction with minimal accuracy loss across 8-GPU and LBW/HBW settings, demonstrating scalable deployment in distributed inference. The work offers practical methods to leverage SPD with existing tensor parallelism while maintaining model quality.

Abstract

With the rapid expansion in the scale of large language models (LLMs), enabling efficient distributed inference across multiple computing units has become increasingly critical. However, communication overheads from popular distributed inference techniques such as Tensor Parallelism pose a significant challenge to achieve scalability and low latency. Therefore, we introduce a novel optimization technique, Sync-Point Drop (SPD), to reduce communication overheads in tensor parallelism by selectively dropping synchronization on attention outputs. In detail, we first propose a block design that allows execution to proceed without communication through SPD. Second, we apply different SPD strategies to attention blocks based on their sensitivity to the model accuracy. The proposed methods effectively alleviate communication bottlenecks while minimizing accuracy degradation during LLM inference, offering a scalable solution for diverse distributed environments: SPD offered about 20% overall inference latency reduction with < 1% accuracy regression for LLaMA2-70B inference over 8 GPUs.

SPD: Sync-Point Drop for Efficient Tensor Parallelism of Large Language Models

TL;DR

SPD addresses the communication bottleneck in tensor-parallel LLM inference by removing sync-points after self-attention. It introduces a block-designed SPD framework and a block-wise sensitivity strategy that classifies blocks into ISB, SB, and ESB, applying zero-shot dropping, distillation, and head grouping to recover accuracy. Empirical results on LLaMA2 and OPT show up to ~20% end-to-end latency reduction with minimal accuracy loss across 8-GPU and LBW/HBW settings, demonstrating scalable deployment in distributed inference. The work offers practical methods to leverage SPD with existing tensor parallelism while maintaining model quality.

Abstract

With the rapid expansion in the scale of large language models (LLMs), enabling efficient distributed inference across multiple computing units has become increasingly critical. However, communication overheads from popular distributed inference techniques such as Tensor Parallelism pose a significant challenge to achieve scalability and low latency. Therefore, we introduce a novel optimization technique, Sync-Point Drop (SPD), to reduce communication overheads in tensor parallelism by selectively dropping synchronization on attention outputs. In detail, we first propose a block design that allows execution to proceed without communication through SPD. Second, we apply different SPD strategies to attention blocks based on their sensitivity to the model accuracy. The proposed methods effectively alleviate communication bottlenecks while minimizing accuracy degradation during LLM inference, offering a scalable solution for diverse distributed environments: SPD offered about 20% overall inference latency reduction with < 1% accuracy regression for LLaMA2-70B inference over 8 GPUs.

Paper Structure

This paper contains 24 sections, 3 equations, 9 figures, 1 table, 1 algorithm.

Figures (9)

  • Figure 1: Tensor parallelism applied on transformer decoder block (in 2-GPUs distributed inference case).
  • Figure 2: Data transfer latency of LLaMA2-70B distributed inference with SPD on different system settings of NVIDIA A100-80G GPU node. 'HBW' represents high bandwidth setting and 'LBW' represents low bandwidth setting for device interconnect. Input consist of batch size of 1 and sequence length of 128 is used.
  • Figure 3: Decoder block structure with sync-point drop (in 2-GPUs distributed inference case). '$W_i$' and '$b$' represent weight and bias of linear layer on each device ($i$). '$X$', '$Y_i$', '$Z_i$' and '$P_i$' denotes a hidden representation of each device ($i$) on '$\mathbin{\vcenter{\hbox{$\m@th\bullet$}}}$' in the figure.
  • Figure 4: Sync sensitivity identification of a decoder block (measuring the sensitivity of $i$-th block).
  • Figure 5: SPD block in case having 8-heads on 4-GPUs parallel with given head subset ($A_i$) and matching combination ($MC$).
  • ...and 4 more figures