SERFLOW: A Cross-Service Cost Optimization Framework for SLO-Aware Dynamic ML Inference
Zongshun Zhang, Ibrahim Matta
TL;DR
SERFLOW addresses the cost-efficient execution of multi-stage ML inference pipelines by distributing partitions across IaaS and FaaS in a way that respects strict SLOs. It introduces per-stage optimization via the Sparsity Cost Indifference Point ($S\text{-}CIP$) and the Traffic Cost Indifference Point ($T\text{-}CIP$), enabling offline configuration profiling followed by online scaling and load balancing that adapt to exit distributions and traffic variability. The framework demonstrates substantial cost reductions, outperforming baselines like LIBRA and FaaS-only configurations while maintaining latency guarantees, especially when model sparsity yields varied exit patterns. Practically, SERFLOW can be integrated into MLaaS platforms to enable economical, dynamic inference across hybrid cloud resources, with potential extensions to other compute platforms and large-scale models.
Abstract
Dynamic offloading of Machine Learning (ML) model partitions across different resource orchestration services, such as Function-as-a-Service (FaaS) and Infrastructure-as-a-Service (IaaS), can balance processing and transmission delays while minimizing costs of adaptive inference applications. However, prior work often overlooks real-world factors, such as Virtual Machine (VM) cold starts, requests under long-tail service time distributions, etc. To tackle these limitations, we model each ML query (request) as traversing an acyclic sequence of stages, wherein each stage constitutes a contiguous block of sparse model parameters ending in an internal or final classifier where requests may exit. Since input-dependent exit rates vary, no single resource configuration suits all query distributions. IaaS-based VMs become underutilized when many requests exit early, yet rapidly scaling to handle request bursts reaching deep layers is impractical. SERFLOW addresses this challenge by leveraging FaaS-based serverless functions (containers) and using stage-specific resource provisioning that accounts for the fraction of requests exiting at each stage. By integrating this provisioning with adaptive load balancing across VMs and serverless functions based on request ingestion, SERFLOW reduces cloud costs by over $23\%$ while efficiently adapting to dynamic workloads.
