Table of Contents
Fetching ...

Characterizing Network Requirements for GPU API Remoting in AI Applications

Tianxia Wang, Zhuofu Chen, Xingda Wei, Jinyu Gu, Rong Chen, Haibo Chen

TL;DR

This work tackles the problem of sizing network resources for GPU API remoting in AI workloads, aiming to keep remoting overhead within a budget $\varepsilon$. It introduces a GPU-centric design and a formal cost model, underpinned by two optimization principles: asynchronous outstanding requests (OR) and shadow descriptors (SR), which together convert many sync APIs to async and overlap CPU/GPU execution. The authors validate their approach through emulation and real RDMA-enabled hardware, deriving network requirements via Cost(APP) ≤ $\varepsilon$ and demonstrating that latency in the range $5$–$20\,\mu$s with hundreds of Gbps bandwidth suffices for many models, with overhead often below 5% and some workloads even improving. They also provide an open-source remoting system and analytical tools, offering practical guidance for data-center network provisioning and enabling efficient AI remoting on commodity networks.

Abstract

GPU remoting is a promising technique for supporting AI applications. Networking plays a key role in enabling remoting. However, for efficient remoting, the network requirements in terms of latency and bandwidth are unknown. In this paper, we take a GPU-centric approach to derive the minimum latency and bandwidth requirements for GPU remoting, while ensuring no (or little) performance degradation for AI applications. Our study including theoretical model demonstrates that, with careful remoting design, unmodified AI applications can run on the remoting setup using commodity networking hardware without any overhead or even with better performance, with low network demands.

Characterizing Network Requirements for GPU API Remoting in AI Applications

TL;DR

This work tackles the problem of sizing network resources for GPU API remoting in AI workloads, aiming to keep remoting overhead within a budget . It introduces a GPU-centric design and a formal cost model, underpinned by two optimization principles: asynchronous outstanding requests (OR) and shadow descriptors (SR), which together convert many sync APIs to async and overlap CPU/GPU execution. The authors validate their approach through emulation and real RDMA-enabled hardware, deriving network requirements via Cost(APP) ≤ and demonstrating that latency in the range s with hundreds of Gbps bandwidth suffices for many models, with overhead often below 5% and some workloads even improving. They also provide an open-source remoting system and analytical tools, offering practical guidance for data-center network provisioning and enabling efficient AI remoting on commodity networks.

Abstract

GPU remoting is a promising technique for supporting AI applications. Networking plays a key role in enabling remoting. However, for efficient remoting, the network requirements in terms of latency and bandwidth are unknown. In this paper, we take a GPU-centric approach to derive the minimum latency and bandwidth requirements for GPU remoting, while ensuring no (or little) performance degradation for AI applications. Our study including theoretical model demonstrates that, with careful remoting design, unmodified AI applications can run on the remoting setup using commodity networking hardware without any overhead or even with better performance, with low network demands.
Paper Structure (13 sections, 3 equations, 11 figures, 6 tables)

This paper contains 13 sections, 3 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: An overview of executing AI applications with (a) and without (b) GPU remoting.
  • Figure 2: A re-plot of the CPU vs. GPU allocation configurations for applications in two production clusters DBLP:journals/corr/abs-2310-04648.
  • Figure 3: Remoting overhead and optimization effects on A100 with (upper) SHM and (lower) RDMA, respectively. We break down the remoting time as follows: API---the execution time of the API without remoting, S+D---serialization and deserialization of the arguments to the network buffer, Send---posting the arguments to the network card and Recv---the time for waiting for the proxy’s reply.
  • Figure 4: An illustration of the optimization space ((b)—(d)) for GPU API remoting, comparing the unoptimized baseline (a).
  • Figure 5: An illustration of how shadow resource enables a synchronous API (CreateTensorDescriptor, ➀) to become asynchronous, while remaining compatible with APIs that depend on it (e.g., ConvolutionForward ➁).
  • ...and 6 more figures