Table of Contents
Fetching ...

AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving

Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, Ion Stoica

TL;DR

AlpaServe reframes model parallelism from a tool for fitting large models to a mechanism for multiplexing multiple models under bursty, latency-sensitive workloads. It jointly optimizes auto-parallelization, placement, and runtime scheduling to maximize SLO attainment, using a simulator-guided greedy approach and group-partitioning strategies. The evaluation on real traces and large-scale clusters shows substantial improvements in throughput, burst tolerance, and latency guarantees with fewer devices compared to replication-based baselines. This work demonstrates that carefully orchestrated model parallelism can dramatically improve serving efficiency for multiple large models in production settings.

Abstract

Model parallelism is conventionally viewed as a method to scale a single large deep learning model beyond the memory limits of a single device. In this paper, we demonstrate that model parallelism can be additionally used for the statistical multiplexing of multiple devices when serving multiple models, even when a single model can fit into a single device. Our work reveals a fundamental trade-off between the overhead introduced by model parallelism and the opportunity to exploit statistical multiplexing to reduce serving latency in the presence of bursty workloads. We explore the new trade-off space and present a novel serving system, AlpaServe, that determines an efficient strategy for placing and parallelizing collections of large deep learning models across a distributed cluster. Evaluation results on production workloads show that AlpaServe can process requests at up to 10x higher rates or 6x more burstiness while staying within latency constraints for more than 99% of requests.

AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving

TL;DR

AlpaServe reframes model parallelism from a tool for fitting large models to a mechanism for multiplexing multiple models under bursty, latency-sensitive workloads. It jointly optimizes auto-parallelization, placement, and runtime scheduling to maximize SLO attainment, using a simulator-guided greedy approach and group-partitioning strategies. The evaluation on real traces and large-scale clusters shows substantial improvements in throughput, burst tolerance, and latency guarantees with fewer devices compared to replication-based baselines. This work demonstrates that carefully orchestrated model parallelism can dramatically improve serving efficiency for multiple large models in production settings.

Abstract

Model parallelism is conventionally viewed as a method to scale a single large deep learning model beyond the memory limits of a single device. In this paper, we demonstrate that model parallelism can be additionally used for the statistical multiplexing of multiple devices when serving multiple models, even when a single model can fit into a single device. Our work reveals a fundamental trade-off between the overhead introduced by model parallelism and the opportunity to exploit statistical multiplexing to reduce serving latency in the presence of bursty workloads. We explore the new trade-off space and present a novel serving system, AlpaServe, that determines an efficient strategy for placing and parallelizing collections of large deep learning models across a distributed cluster. Evaluation results on production workloads show that AlpaServe can process requests at up to 10x higher rates or 6x more burstiness while staying within latency constraints for more than 99% of requests.
Paper Structure (24 sections, 5 equations, 17 figures, 2 tables, 2 algorithms)

This paper contains 24 sections, 5 equations, 17 figures, 2 tables, 2 algorithms.

Figures (17)

  • Figure 1: Two placement strategies for serving two models on two GPUs. In each subfigure, the left part shows the model placements and the right part shows the timeline for handling bursty requests. At the time of "Burst 1", 4 requests of model A come at the same time. Colocation with model parallelism can reduce the average completion time of bursty requests.
  • Figure 2: Latency CDF and cluster utilization in the 2-model example.
  • Figure 3: Replication and model parallel placement illustration with different memory budgets, where the memory budgets are set to be multiples of a single model's size.
  • Figure 4: Serving performance with changing per-GPU memory budgets. Model parallelism is beneficial for limited memory budget. The dashed vertical line is the real per-GPU memory bound of a 16GB V100. The value is around 13GB due to the need to store activations and other runtime context.
  • Figure 5: Serving performance with changing arrival rates. Model parallelism is beneficial for smaller rates.
  • ...and 12 more figures