Table of Contents
Fetching ...

MultiPath Transfer Engine: Breaking GPU and Host-Memory Bandwidth Bottlenecks in LLM Services

Lingfeng Tang, Daoping Zhang, Junjie Chen, Peihao Huang, Feng Jin, Chengguang Xu, Yuxin Chen, Feiqiang Sun, Guo Chen

TL;DR

The paper tackles the PCIe bandwidth bottleneck in LLM serving by introducing MMA, a CUDA-level framework that enables multipath memory access between host memory and GPUs using dummy tasks, a synchronization layer, and a multipath transfer engine. It demonstrates transparent deployment via LD_PRELOAD and achieves a peak intra-server bandwidth of 245 GB/s, a 4.62× improvement over single-path baselines, along with TTFT reductions of 1.14×–2.38× and model-switching latency reductions of 1.12×–2.48×. Key contributions include the design and implementation of Transfer Task Interceptor, Sync Engine, and a three-part Multipath Transfer Engine with dynamic chunking, congestion-aware path selection, and dual-pipeline relay, validated on an eight-GPU system. The work shows substantial practical impact for large-context LLM serving and dynamic model switching, highlighting potential for hardware-software co-design to enable native intra-server multipath scheduling.

Abstract

The limited bandwidth of PCIe has emerged as the critical bottleneck for large language model (LLM) performance, such as prefix cache fetching and model switching. Although intra-server multipath data transfer between GPU and host memory is theoretically possible, heterogeneous protocols such as PCIe and NVLink currently limit the bandwidth between host memory and GPUs to that of a single PICe link. This limitation resuals in underutilized intra-server bandwidth. To address this issue, we propose Multipath Memory Access (MMA), a scheme that, to the best of our knowledge, is the first to enalbe efficient multipath data transfer between GPU and host memory. MMA supports seamless deployment via dynamic library injection, enabling LLM applications to benefit from MMA without requiring any code modification. In our testbed, MMA significantly improves the data transfer bandwidth between the GPU and memory, achieving a peak bandwidth of 245 GB/s-representing a 4.62x speedup compared to the natice single-path bandwidth. End-to-end evaluations demonstrate that MMA reduces the time-to-first-token (TTFT) for LLM serving by 1.14x to 2.38x and decreases model-switching latency in vLLM's sleep mode by 1.12x to 2.48x.

MultiPath Transfer Engine: Breaking GPU and Host-Memory Bandwidth Bottlenecks in LLM Services

TL;DR

The paper tackles the PCIe bandwidth bottleneck in LLM serving by introducing MMA, a CUDA-level framework that enables multipath memory access between host memory and GPUs using dummy tasks, a synchronization layer, and a multipath transfer engine. It demonstrates transparent deployment via LD_PRELOAD and achieves a peak intra-server bandwidth of 245 GB/s, a 4.62× improvement over single-path baselines, along with TTFT reductions of 1.14×–2.38× and model-switching latency reductions of 1.12×–2.48×. Key contributions include the design and implementation of Transfer Task Interceptor, Sync Engine, and a three-part Multipath Transfer Engine with dynamic chunking, congestion-aware path selection, and dual-pipeline relay, validated on an eight-GPU system. The work shows substantial practical impact for large-context LLM serving and dynamic model switching, highlighting potential for hardware-software co-design to enable native intra-server multipath scheduling.

Abstract

The limited bandwidth of PCIe has emerged as the critical bottleneck for large language model (LLM) performance, such as prefix cache fetching and model switching. Although intra-server multipath data transfer between GPU and host memory is theoretically possible, heterogeneous protocols such as PCIe and NVLink currently limit the bandwidth between host memory and GPUs to that of a single PICe link. This limitation resuals in underutilized intra-server bandwidth. To address this issue, we propose Multipath Memory Access (MMA), a scheme that, to the best of our knowledge, is the first to enalbe efficient multipath data transfer between GPU and host memory. MMA supports seamless deployment via dynamic library injection, enabling LLM applications to benefit from MMA without requiring any code modification. In our testbed, MMA significantly improves the data transfer bandwidth between the GPU and memory, achieving a peak bandwidth of 245 GB/s-representing a 4.62x speedup compared to the natice single-path bandwidth. End-to-end evaluations demonstrate that MMA reduces the time-to-first-token (TTFT) for LLM serving by 1.14x to 2.38x and decreases model-switching latency in vLLM's sleep mode by 1.12x to 2.48x.

Paper Structure

This paper contains 25 sections, 1 equation, 16 figures, 2 tables.

Figures (16)

  • Figure 1: Simplified Intra-Server PCIe Topology. Each NUMA node contains two PCIe switches, and eight GPUs are interconnected via NVLink.
  • Figure 2: The proportion of prefix-cache fetching time in TTFT under different hit-token lengths and models
  • Figure 3: The proportion of H2D/D2H transfer time in swap-in and swap-out latency under different models
  • Figure 4: Traffic imbalance in LLM applications
  • Figure 5: Overview of MMA.
  • ...and 11 more figures