Flash Communication: Reducing Tensor Parallelization Bottleneck for Fast Large Language Model Inference
Qingyuan Li, Bo Zhang, Liang Ye, Yifan Zhang, Wei Wu, Yerui Sun, Lin Ma, Yuchen Xie
TL;DR
This paper tackles the tensor-parallel inference bottleneck in large language models by introducing Flash Communication, a low-bit activation quantization technique paired with a two-step All-Reduce. A fused CUDA kernel implements the approach, dramatically reducing intra-node communication time and lowering time-to-first-token with minimal accuracy impact. Through extensive experiments on LLaMA-2/3 models across GPUs (L40 and A100), the method delivers up to 2x TTFT improvement and demonstrates robust accuracy preservation across benchmarks. The work offers a practical path to faster, scalable LLM inference by optimizing communication volume and reducing reduction hops in tensor-parallel setups.
Abstract
The ever-increasing sizes of large language models necessitate distributed solutions for fast inference that exploit multi-dimensional parallelism, where computational loads are split across various accelerators such as GPU clusters. However, this approach often introduces significant communication overhead, especially on devices with limited bandwidth. In this paper, we introduce Flash Communication, a novel low-bit compression technique designed to alleviate the tensor-parallelism communication bottleneck during inference. Our method substantially boosts intra-node communication speed by more than 3x and reduces the time-to-first-token by 2x, with nearly no sacrifice in model accuracy. Extensive experiments on various up-to-date LLMs demonstrate the effectiveness of our approach.
