Table of Contents
Fetching ...

Serving Compound Inference Systems on Datacenter GPUs

Sriram Devata, Rahul Singh, Sarita Adve

TL;DR

JigsawServe is presented, the first serving framework that jointly optimizes for latency, accuracy, and cost in terms of GPU resources by adaptively choosing model variants and performing fine-grained resource allocation by spatially partitioning the GPUs for each task of a compound inference system.

Abstract

Applications in emerging domains such as XR are being built as compound inference systems, where multiple ML models are composed in the form of a task graph to service each request. Serving these compound systems efficiently raises two questions: how to apportion end-to-end latency and accuracy budgets between different tasks in a compound inference system, and how to allocate resources effectively for different models with varying resource requirements. We present JigsawServe, the first serving framework that jointly optimizes for latency, accuracy, and cost in terms of GPU resources by adaptively choosing model variants and performing fine-grained resource allocation by spatially partitioning the GPUs for each task of a compound inference system. Analytical evaluation of a system with a large number of GPUs shows that JigsawServe can increase the maximum serviceable demand (in requests per second) by 11.3x when compared to the closest prior work. Our empirical evaluation shows that for a large range of scenarios, JigsawServe consumes only 43.3% of the available GPU resources while meeting accuracy SLOs with less than 0.6% latency SLO violations. All of the features in JigsawServe contribute to this high efficiency -- sacrificing any one feature of accuracy scaling, GPU spatial partitioning, or task-graph-informed resource budgeting significantly reduces efficiency.

Serving Compound Inference Systems on Datacenter GPUs

TL;DR

JigsawServe is presented, the first serving framework that jointly optimizes for latency, accuracy, and cost in terms of GPU resources by adaptively choosing model variants and performing fine-grained resource allocation by spatially partitioning the GPUs for each task of a compound inference system.

Abstract

Applications in emerging domains such as XR are being built as compound inference systems, where multiple ML models are composed in the form of a task graph to service each request. Serving these compound systems efficiently raises two questions: how to apportion end-to-end latency and accuracy budgets between different tasks in a compound inference system, and how to allocate resources effectively for different models with varying resource requirements. We present JigsawServe, the first serving framework that jointly optimizes for latency, accuracy, and cost in terms of GPU resources by adaptively choosing model variants and performing fine-grained resource allocation by spatially partitioning the GPUs for each task of a compound inference system. Analytical evaluation of a system with a large number of GPUs shows that JigsawServe can increase the maximum serviceable demand (in requests per second) by 11.3x when compared to the closest prior work. Our empirical evaluation shows that for a large range of scenarios, JigsawServe consumes only 43.3% of the available GPU resources while meeting accuracy SLOs with less than 0.6% latency SLO violations. All of the features in JigsawServe contribute to this high efficiency -- sacrificing any one feature of accuracy scaling, GPU spatial partitioning, or task-graph-informed resource budgeting significantly reduces efficiency.
Paper Structure (18 sections, 14 equations, 5 figures, 2 tables)

This paper contains 18 sections, 14 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: An overview of JigsawServe.
  • Figure 2: Applications and their inference task graphs used to evaluate JigsawServe. Each task is annotated with the model variants that can perform the task, along with the relevant accuracy metrics.
  • Figure 3: Maximum demand that can be satisfied for the traffic analysis application on a large testbed by exploring combinations of A/S/T. Higher maximum demand is better.
  • Figure 4: (a), (b), and (c) show the statistics of all demand timestamps for the evaluated applications. We cap the SLO violation rate at 50% for better visualization. (d), (e), (f) show the aggregated statistics for the low demand conditions (timestamp 180-240), high demand conditions (timestamp 100-150), and the average over all demand timestamps. The bars are hatched if the SLO violation rate is $\geq$10%. For all metrics, lower is better.
  • Figure 5: The frequency of model variants and GPU segment types in the configurations chosen by JigsawServe for the evaluated applications. The GPU segment types are described by both a MIG instance type (1/7 - 7/7) and an MPS concurrency level (1, 2, 3, 4).