Table of Contents
Fetching ...

Sponge: Inference Serving with Dynamic SLOs Using In-Place Vertical Scaling

Kamran Razavi, Saeid Ghafouri, Max Mühlhäuser, Pooyan Jamshidi, Lin Wang

TL;DR

This work tackles end-to-end SLO variability for deep learning inference on mobile/IoT over volatile wireless networks. It introduces Sponge, a system combining in-place vertical scaling, dynamic batching, and request reordering, guided by an Integer Programming formulation that links latency $l(b,c)$, throughput $h(b,c)$, and resources $c$ (cores) and $b$ (batch). A prototype demonstrates substantial improvements, reducing SLO violations (often by >15× vs horizontal autoscalers) and lowering resource use, even under dynamic bandwidth, by leveraging a performance model with $l(b,c)$ and $L(b,c)$ and an EDF queue to guarantee per-request SLOs. The approach highlights practical gains for resource-efficient DL inference in mobile-edge settings and outlines future work to support model pipelines, multiple variants, and joint horizontal scaling.

Abstract

Mobile and IoT applications increasingly adopt deep learning inference to provide intelligence. Inference requests are typically sent to a cloud infrastructure over a wireless network that is highly variable, leading to the challenge of dynamic Service Level Objectives (SLOs) at the request level. This paper presents Sponge, a novel deep learning inference serving system that maximizes resource efficiency while guaranteeing dynamic SLOs. Sponge achieves its goal by applying in-place vertical scaling, dynamic batching, and request reordering. Specifically, we introduce an Integer Programming formulation to capture the resource allocation problem, providing a mathematical model of the relationship between latency, batch size, and resources. We demonstrate the potential of Sponge through a prototype implementation and preliminary experiments and discuss future works.

Sponge: Inference Serving with Dynamic SLOs Using In-Place Vertical Scaling

TL;DR

This work tackles end-to-end SLO variability for deep learning inference on mobile/IoT over volatile wireless networks. It introduces Sponge, a system combining in-place vertical scaling, dynamic batching, and request reordering, guided by an Integer Programming formulation that links latency , throughput , and resources (cores) and (batch). A prototype demonstrates substantial improvements, reducing SLO violations (often by >15× vs horizontal autoscalers) and lowering resource use, even under dynamic bandwidth, by leveraging a performance model with and and an EDF queue to guarantee per-request SLOs. The approach highlights practical gains for resource-efficient DL inference in mobile-edge settings and outlines future work to support model pipelines, multiple variants, and joint horizontal scaling.

Abstract

Mobile and IoT applications increasingly adopt deep learning inference to provide intelligence. Inference requests are typically sent to a cloud infrastructure over a wireless network that is highly variable, leading to the challenge of dynamic Service Level Objectives (SLOs) at the request level. This paper presents Sponge, a novel deep learning inference serving system that maximizes resource efficiency while guaranteeing dynamic SLOs. Sponge achieves its goal by applying in-place vertical scaling, dynamic batching, and request reordering. Specifically, we introduce an Integer Programming formulation to capture the resource allocation problem, providing a mathematical model of the relationship between latency, batch size, and resources. We demonstrate the potential of Sponge through a prototype implementation and preliminary experiments and discuss future works.
Paper Structure (12 sections, 3 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 12 sections, 3 equations, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: Bandwidth measurements in 4G networks provided by vanderHooft2016. The bandwidth varies from 0.5MB/s to 7MB/s in a 10-minute range (top figure). The below figure demonstrates the remaining SLO for processing when the user sends a 100 KB, a 200 KB, or a 500 KB image over the same network's bandwidth.
  • Figure 2: An overview of the Sponge architecture. The monitoring service collects metric data from the DL model. The queue prioritizes requests according to the EDF policy. The scaler is responsible for determining vertical scaling and batch size decisions for the DL model and adjusting the system accordingly.
  • Figure 3: Latency vs. different CPU core allocations and batch sizes using real and predicted for the YOLOv5n and ResNet18 DL models.
  • Figure 4: SLO violations and allocated CPU cores.