CascadeServe: Unlocking Model Cascades for Inference Serving

Ferdi Kossmann; Ziniu Wu; Alex Turk; Nesime Tatbul; Lei Cao; Samuel Madden

CascadeServe: Unlocking Model Cascades for Inference Serving

Ferdi Kossmann, Ziniu Wu, Alex Turk, Nesime Tatbul, Lei Cao, Samuel Madden

TL;DR

CascadeServe is proposed, which automates and optimizes end-to-end inference serving with cascades, and finds that CascadeServe saves 2-3x in cost across a wide spectrum of the latency-accuracy space when compared to state-of-the-art baselines on different workloads.

Abstract

Machine learning (ML) models are increasingly deployed to production, calling for efficient inference serving systems. Efficient inference serving is complicated by two challenges: (i) ML models incur high computational costs, and (ii) the request arrival rates of practical applications have frequent, high, and sudden variations which make it hard to correctly provision hardware. Model cascades are positioned to tackle both of these challenges, as they (i) save work while maintaining accuracy, and (ii) expose a high-resolution trade-off between work and accuracy, allowing for fine-grained adjustments to request arrival rates. Despite their potential, model cascades haven't been used inside an online serving system. This comes with its own set of challenges, including workload adaption, model replication onto hardware, inference scheduling, request batching, and more. In this work, we propose CascadeServe, which automates and optimizes end-to-end inference serving with cascades. CascadeServe operates in an offline and online phase. In the offline phase, the system pre-computes a gear plan that specifies how to serve inferences online. In the online phase, the gear plan allows the system to serve inferences while making near-optimal adaptations to the query load at negligible decision overheads. We find that CascadeServe saves 2-3x in cost across a wide spectrum of the latency-accuracy space when compared to state-of-the-art baselines on different workloads.

CascadeServe: Unlocking Model Cascades for Inference Serving

TL;DR

Abstract

Paper Structure (58 sections, 5 equations, 25 figures, 2 algorithms)

This paper contains 58 sections, 5 equations, 25 figures, 2 algorithms.

Introduction
Background
Model cascades
Related work
Inference serving.
System overview
Gear plan generation
Subproblem co-optimization
Iteration logic.
Cascade search
Workload adaption
Hardware mapping
Dynamic batching
Gear plan operation
Inference server.
...and 43 more sections

Figures (25)

Figure 1: Per-sample processing times and accuracy of fine-tuned BERT models on the Sentiment-140 benchmark sentiment140.
Figure 2: Processing times of BERT cascades when changing model placement and batch sizes.
Figure 3: Overview of CascadeServe.
Figure 4: Online serving architecture.
Figure 5: End-to-end performance on the BERT workload.
...and 20 more figures

CascadeServe: Unlocking Model Cascades for Inference Serving

TL;DR

Abstract

CascadeServe: Unlocking Model Cascades for Inference Serving

Authors

TL;DR

Abstract

Table of Contents

Figures (25)