DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

Yongtong Wu; Shaoyuan Chen; Yinmin Zhong; Rilin Huang; Yixuan Tan; Wentao Zhang; Liyue Zhang; Shangyan Zhou; Yuxuan Liu; Shunfeng Zhou; Mingxing Zhang; Xin Jin; Panpan Huang

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

Yongtong Wu, Shaoyuan Chen, Yinmin Zhong, Rilin Huang, Yixuan Tan, Wentao Zhang, Liyue Zhang, Shangyan Zhou, Yuxuan Liu, Shunfeng Zhou, Mingxing Zhang, Xin Jin, Panpan Huang

TL;DR

DualPath is presented, an inference system that breaks this bottleneck by introducing dual-path KV-Cache loading and enables a novel storage-to-decode path, in which the KV-Cache is loaded into decoding engines and then efficiently transferred to prefill engines via RDMA over the compute network.

Abstract

The performance of multi-turn, agentic LLM inference is increasingly dominated by KV-Cache storage I/O rather than computation. In prevalent disaggregated architectures, loading the massive KV-Cache from external storage creates a fundamental imbalance: storage NICs on prefill engines become bandwidth-saturated, while those on decoding engines remain idle. This asymmetry severely constrains overall system throughput. We present DualPath, an inference system that breaks this bottleneck by introducing dual-path KV-Cache loading. Beyond the traditional storage-to-prefill path, DualPath enables a novel storage-to-decode path, in which the KV-Cache is loaded into decoding engines and then efficiently transferred to prefill engines via RDMA over the compute network. DualPath combines this optimized data path -- which inherently avoids network congestion and avoids interference with latency-critical model execution communications -- with a global scheduler that dynamically balances load across prefill and decode engines. Our evaluation on three models with production agentic workloads demonstrates that DualPath improves offline inference throughput by up to 1.87$\times$ on our in-house inference system. It can also improve online serving throughput by an average factor of 1.96$\times$ without violating SLO.

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

TL;DR

Abstract

on our in-house inference system. It can also improve online serving throughput by an average factor of 1.96

without violating SLO.

Paper Structure (34 sections, 9 equations, 15 figures, 3 tables, 1 algorithm)

This paper contains 34 sections, 9 equations, 15 figures, 3 tables, 1 algorithm.

Introduction
Background
LLM Inference Preliminary
Agentic Use of LLMs
Modern AI Data Center Architecture
Bottleneck & Motivation
DualPath System Overview
Dual-Path Loading
Bottleneck-Free Analysis
Practical Challenges
CNIC-Centric Traffic Manager
Traffic Isolation
CNIC-Assisted KV-Cache Copy
Adaptive Request Scheduler
Inter-Engine Scheduling
...and 19 more sections

Figures (15)

Figure 1: Existing bottleneck (left) and DualPath (right).
Figure 2: Agent trajectory example.
Figure 3: Left: Hardware trends of NVIDIA GPUs. Right: Relative token throughput with varying request batch size (each request has 30K context with 300 tokens appended).
Figure 4: Dual-path loading illustration. The scheduler dynamically distributes data traffic between the two paths.
Figure 5: An illustration of Inter-Engine PE Scheduling. All eight GPUs are in the same PE engine group and the scheduler will choose the best.
...and 10 more figures

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

TL;DR

Abstract

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

Authors

TL;DR

Abstract

Table of Contents

Figures (15)