Table of Contents
Fetching ...

FaaSTube: Optimizing GPU-oriented Data Transfer for Serverless Computing

Hao Wu, Junxiao Deng, Minchen Yu, Yue Yu, Yaochen Liu, Hao Fan, Song Wu, Wei Wang

TL;DR

FaaSTube is presented, a GPU-efficient data passing system for serverless inference that enables fine-grained bandwidth sharing over PCIe and NVLink, minimizing data-passing latency for both host-to-GPU and GPU-to-GPU while providing performance isolation between functions.

Abstract

Serverless computing has gained significant traction for machine learning inference applications, which are often deployed as serverless workflows consisting of multiple CPU and GPU functions with data dependency. However, existing data-passing solutions for serverless computing primarily reply on host memory for fast data transfer, mandating substantial data movement and resulting in salient I/O overhead. In this paper, we present FaaSTube, a GPU-efficient data passing system for serverless inference. FaaSTube manages intermediate data within a GPU memory pool to facilitate direct data exchange between GPU functions. It enables fine-grained bandwidth sharing over PCIe and NVLink, minimizing data-passing latency for both host-to-GPU and GPU-to-GPU while providing performance isolation between functions. Additionally, FaaSTube implements an elastic GPU memory pool that dynamically scales to accommodate varying data-passing demands. Evaluations on real-world applications show that FaaSTube reduces end-to-end latency by up to 90\% and achieves up to 12x higher throughput compared to the state-of-the-art.

FaaSTube: Optimizing GPU-oriented Data Transfer for Serverless Computing

TL;DR

FaaSTube is presented, a GPU-efficient data passing system for serverless inference that enables fine-grained bandwidth sharing over PCIe and NVLink, minimizing data-passing latency for both host-to-GPU and GPU-to-GPU while providing performance isolation between functions.

Abstract

Serverless computing has gained significant traction for machine learning inference applications, which are often deployed as serverless workflows consisting of multiple CPU and GPU functions with data dependency. However, existing data-passing solutions for serverless computing primarily reply on host memory for fast data transfer, mandating substantial data movement and resulting in salient I/O overhead. In this paper, we present FaaSTube, a GPU-efficient data passing system for serverless inference. FaaSTube manages intermediate data within a GPU memory pool to facilitate direct data exchange between GPU functions. It enables fine-grained bandwidth sharing over PCIe and NVLink, minimizing data-passing latency for both host-to-GPU and GPU-to-GPU while providing performance isolation between functions. Additionally, FaaSTube implements an elastic GPU memory pool that dynamically scales to accommodate varying data-passing demands. Evaluations on real-world applications show that FaaSTube reduces end-to-end latency by up to 90\% and achieves up to 12x higher throughput compared to the state-of-the-art.

Paper Structure

This paper contains 28 sections, 17 figures, 1 table, 1 algorithm.

Figures (17)

  • Figure 1: A typical traffic analysis application AdainfBoggard.
  • Figure 2: Comparison of host-oriented inter-function data passing and our GPU-oriented inter-function data passing.
  • Figure 3: Performance analysis of real-world inference workflows on INFless+. (a) Breaking down of overall latency. (b) Breaking down of latency for Traffic workflow with various batch sizes. Each bar is broken into three parts: the latencies of host-to-gFunc data passing (top), gFunc-to-gFunc data passing (middle), and computation (bottom).
  • Figure 4: Vairous connection topologies in GPU servers: (a) PCIe connections between host and GPUs, (b) 8 GPUs connected via hard-wired NVLinks like DGX V100 and (c) 8 GPUs connected via switch-based NVLinks like DGX A100.
  • Figure 5: (a) Comparison of host-to-gFunc passing overhead between separate execution and together execution of video and driving workflows. (b) The Impact of pinned memory in PCIe transfers.
  • ...and 12 more figures