Table of Contents
Fetching ...

DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72

Wanqian Li, Jintao Peng, Zongfei Jing, Tianyu Zhang, Ze Long, Xianjie Qiao, Xiaoming Chen, Dongxu Yang, Kefeng Duan, June Yang

Abstract

Large language model (LLM) inference increasingly depends on multi-GPU execution, yet existing inference parallelization strategies require layer-wise inter-rank synchronization, making end-to-end performance sensitive to workload imbalance. We present DWDP (Distributed Weight Data Parallelism), an inference parallelization strategy that preserves data-parallel execution while offloading MoE weights across peer GPUs and fetching missing experts on demand. By removing collective inter-rank synchronization, DWDP allows each GPU to progress independently. We further address the practical overheads of this design with two optimizations for split-weight management and asynchronous remote-weight prefetch. Implemented in TensorRT-LLM and evaluated with DeepSeek-R1 on GB200 NVL72, DWDP improves end-to-end output TPS/GPU by 8.8% at comparable TPS/user in the 20-100 TPS/user serving range under 8K input sequence length and 1K output sequence length.

DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72

Abstract

Large language model (LLM) inference increasingly depends on multi-GPU execution, yet existing inference parallelization strategies require layer-wise inter-rank synchronization, making end-to-end performance sensitive to workload imbalance. We present DWDP (Distributed Weight Data Parallelism), an inference parallelization strategy that preserves data-parallel execution while offloading MoE weights across peer GPUs and fetching missing experts on demand. By removing collective inter-rank synchronization, DWDP allows each GPU to progress independently. We further address the practical overheads of this design with two optimizations for split-weight management and asynchronous remote-weight prefetch. Implemented in TensorRT-LLM and evaluated with DeepSeek-R1 on GB200 NVL72, DWDP improves end-to-end output TPS/GPU by 8.8% at comparable TPS/user in the 20-100 TPS/user serving range under 8K input sequence length and 1K output sequence length.

Paper Structure

This paper contains 33 sections, 2 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Synchronization overhead caused by workload imbalance in DEP. (a) Illustration of how request-level and weight-level imbalance are translated into waiting time in DEP. (b) Kernel breakdown quantifying the synchronization overhead caused by imbalance under DEP. Configuration: DeepSeek-R1 on GB200 with input sequence length/output sequence length (ISL/OSL) = 8K/1 and input ratio 0.8, meaning that the input lengths range from 0.8$\times$8K to 8K.
  • Figure 2: Overview of DWDP with DWDP group size 4.
  • Figure 3: Roofline-based preliminary analysis for the DeepSeek-R1 context phase on GB200, comparing DWDP4 against DEP4 at batch size 1. The two subplots separately show the compute-to-prefetch ratio and the DEP-to-DWDP runtime ratio. The dashed line at $y=1$ marks the boundary where prefetch can be fully hidden and where DWDP begins to outperform DEP.
  • Figure 4: Nsight Systems trace showing many-to-one source-side communication contention in DWDP under max_num_tokens$=16384$ and input sequence lengths ranging from 4K to 8K. Multiple destination ranks concurrently pull remote weights from the same source rank, so the source-side copy engine serializes these requests and exposes compute bubbles.
  • Figure 5: End-to-end Pareto frontier comparison between baseline and DWDP.
  • ...and 2 more figures