Table of Contents
Fetching ...

SERFLOW: A Cross-Service Cost Optimization Framework for SLO-Aware Dynamic ML Inference

Zongshun Zhang, Ibrahim Matta

TL;DR

SERFLOW addresses the cost-efficient execution of multi-stage ML inference pipelines by distributing partitions across IaaS and FaaS in a way that respects strict SLOs. It introduces per-stage optimization via the Sparsity Cost Indifference Point ($S\text{-}CIP$) and the Traffic Cost Indifference Point ($T\text{-}CIP$), enabling offline configuration profiling followed by online scaling and load balancing that adapt to exit distributions and traffic variability. The framework demonstrates substantial cost reductions, outperforming baselines like LIBRA and FaaS-only configurations while maintaining latency guarantees, especially when model sparsity yields varied exit patterns. Practically, SERFLOW can be integrated into MLaaS platforms to enable economical, dynamic inference across hybrid cloud resources, with potential extensions to other compute platforms and large-scale models.

Abstract

Dynamic offloading of Machine Learning (ML) model partitions across different resource orchestration services, such as Function-as-a-Service (FaaS) and Infrastructure-as-a-Service (IaaS), can balance processing and transmission delays while minimizing costs of adaptive inference applications. However, prior work often overlooks real-world factors, such as Virtual Machine (VM) cold starts, requests under long-tail service time distributions, etc. To tackle these limitations, we model each ML query (request) as traversing an acyclic sequence of stages, wherein each stage constitutes a contiguous block of sparse model parameters ending in an internal or final classifier where requests may exit. Since input-dependent exit rates vary, no single resource configuration suits all query distributions. IaaS-based VMs become underutilized when many requests exit early, yet rapidly scaling to handle request bursts reaching deep layers is impractical. SERFLOW addresses this challenge by leveraging FaaS-based serverless functions (containers) and using stage-specific resource provisioning that accounts for the fraction of requests exiting at each stage. By integrating this provisioning with adaptive load balancing across VMs and serverless functions based on request ingestion, SERFLOW reduces cloud costs by over $23\%$ while efficiently adapting to dynamic workloads.

SERFLOW: A Cross-Service Cost Optimization Framework for SLO-Aware Dynamic ML Inference

TL;DR

SERFLOW addresses the cost-efficient execution of multi-stage ML inference pipelines by distributing partitions across IaaS and FaaS in a way that respects strict SLOs. It introduces per-stage optimization via the Sparsity Cost Indifference Point () and the Traffic Cost Indifference Point (), enabling offline configuration profiling followed by online scaling and load balancing that adapt to exit distributions and traffic variability. The framework demonstrates substantial cost reductions, outperforming baselines like LIBRA and FaaS-only configurations while maintaining latency guarantees, especially when model sparsity yields varied exit patterns. Practically, SERFLOW can be integrated into MLaaS platforms to enable economical, dynamic inference across hybrid cloud resources, with potential extensions to other compute platforms and large-scale models.

Abstract

Dynamic offloading of Machine Learning (ML) model partitions across different resource orchestration services, such as Function-as-a-Service (FaaS) and Infrastructure-as-a-Service (IaaS), can balance processing and transmission delays while minimizing costs of adaptive inference applications. However, prior work often overlooks real-world factors, such as Virtual Machine (VM) cold starts, requests under long-tail service time distributions, etc. To tackle these limitations, we model each ML query (request) as traversing an acyclic sequence of stages, wherein each stage constitutes a contiguous block of sparse model parameters ending in an internal or final classifier where requests may exit. Since input-dependent exit rates vary, no single resource configuration suits all query distributions. IaaS-based VMs become underutilized when many requests exit early, yet rapidly scaling to handle request bursts reaching deep layers is impractical. SERFLOW addresses this challenge by leveraging FaaS-based serverless functions (containers) and using stage-specific resource provisioning that accounts for the fraction of requests exiting at each stage. By integrating this provisioning with adaptive load balancing across VMs and serverless functions based on request ingestion, SERFLOW reduces cloud costs by over while efficiently adapting to dynamic workloads.

Paper Structure

This paper contains 17 sections, 11 equations, 15 figures, 1 table, 4 algorithms.

Figures (15)

  • Figure 1: Single-Stage (Fig. \ref{['fig:Single-Stage']}): All requests from the source ($N_{0}$) go through the whole model; Multi-Stage with Internal Classifiers (Fig. \ref{['fig:SDN_Architecture']}): $\beta_{pid}$ of requests exit at NN partition $F_{pid}$ and the remaining $N_{pid}$ requests per second are fed into partition $F_{pid+1}$.
  • Figure 2: When $conf\_thres \leq 0.8$, Hybrid Offloading is cheaper than IaaS-only, given $r_{max}=100/6$.
  • Figure 3: When $conf\_thres \geq 0.55$, Hybrid Offloading is cheaper than FaaS-only, given $r_{max}=100/6$.
  • Figure 4: Overview of SERFLOW: Offline, we replay steady traffic with long-term average $N$ req/s from historical traces to profile three candidate setups (VM-only, Hybrid Offloading, FaaS-only). Step 1: the SLO-Aware Configurator keeps only SLO-feasible configs. Step 2: given the observed early-exit rate $\beta$, the $\beta$-Aware Configurator picks the lowest-cost setup from the three candidates. Online, the Scaling Manager maintains a group of low-cost instances derived from the selected setup, using the EWMA ($\mu_{t}$) and deviation ($\sigma_{t}$) of the online arrival rate $\lambda_{t}$ (req/s). The load balancer then handles spikes at time $t{+}1$, $\lambda_{t+1}$, by first fully utilizing the available low-cost instances provisioned at time $t$ and sending any remaining traffic to FaaS. When the $\beta$ distribution over early exits drifts, the system triggers targeted re-profiling and re-selection via the $\beta$-Aware Configurator.
  • Figure 5: The SLO-aware Configurator searches for the SLO-compliant configurations given a per-instance ingestion rate $r_{max}$ requests per second for VMs, Hybrid Offloading and Serverless.
  • ...and 10 more figures