Table of Contents
Fetching ...

IPA: Inference Pipeline Adaptation to Achieve High Accuracy and Cost-Efficiency

Saeid Ghafouri, Kamran Razavi, Mehran Salmani, Alireza Sanaee, Tania Lorido-Botran, Lin Wang, Joseph Doyle, Pooyan Jamshidi

TL;DR

IPA addresses the challenge of jointly optimizing end-to-end accuracy, latency, and cost in multi-stage inference pipelines under SLA constraints. It couples offline profiling of per-variant latency and accuracy with an online IP-based optimizer and a predictive LSTM forecaster to adapt batch size, replication, and variant selection, maximizing Pipeline Accuracy Score ($PAS$) while minimizing resource use. On Kubernetes with five real-world pipelines, it delivers up to 21% improvement in $PAS$ with minimal cost increase. It advances a production-ready, tunable framework that unifies variant switching, autoscaling, and batching for multi-stage inference, and outlines future directions for scalability and GPU-sharing.

Abstract

Efficiently optimizing multi-model inference pipelines for fast, accurate, and cost-effective inference is a crucial challenge in machine learning production systems, given their tight end-to-end latency requirements. To simplify the exploration of the vast and intricate trade-off space of latency, accuracy, and cost in inference pipelines, providers frequently opt to consider one of them. However, the challenge lies in reconciling latency, accuracy, and cost trade-offs. To address this challenge and propose a solution to efficiently manage model variants in inference pipelines, we present IPA, an online deep learning Inference Pipeline Adaptation system that efficiently leverages model variants for each deep learning task. Model variants are different versions of pre-trained models for the same deep learning task with variations in resource requirements, latency, and accuracy. IPA dynamically configures batch size, replication, and model variants to optimize accuracy, minimize costs, and meet user-defined latency Service Level Agreements (SLAs) using Integer Programming. It supports multi-objective settings for achieving different trade-offs between accuracy and cost objectives while remaining adaptable to varying workloads and dynamic traffic patterns. Navigating a wider variety of configurations allows \namex{} to achieve better trade-offs between cost and accuracy objectives compared to existing methods. Extensive experiments in a Kubernetes implementation with five real-world inference pipelines demonstrate that IPA improves end-to-end accuracy by up to 21% with a minimal cost increase. The code and data for replications are available at https://github.com/reconfigurable-ml-pipeline/ipa.

IPA: Inference Pipeline Adaptation to Achieve High Accuracy and Cost-Efficiency

TL;DR

IPA addresses the challenge of jointly optimizing end-to-end accuracy, latency, and cost in multi-stage inference pipelines under SLA constraints. It couples offline profiling of per-variant latency and accuracy with an online IP-based optimizer and a predictive LSTM forecaster to adapt batch size, replication, and variant selection, maximizing Pipeline Accuracy Score () while minimizing resource use. On Kubernetes with five real-world pipelines, it delivers up to 21% improvement in with minimal cost increase. It advances a production-ready, tunable framework that unifies variant switching, autoscaling, and batching for multi-stage inference, and outlines future directions for scalability and GPU-sharing.

Abstract

Efficiently optimizing multi-model inference pipelines for fast, accurate, and cost-effective inference is a crucial challenge in machine learning production systems, given their tight end-to-end latency requirements. To simplify the exploration of the vast and intricate trade-off space of latency, accuracy, and cost in inference pipelines, providers frequently opt to consider one of them. However, the challenge lies in reconciling latency, accuracy, and cost trade-offs. To address this challenge and propose a solution to efficiently manage model variants in inference pipelines, we present IPA, an online deep learning Inference Pipeline Adaptation system that efficiently leverages model variants for each deep learning task. Model variants are different versions of pre-trained models for the same deep learning task with variations in resource requirements, latency, and accuracy. IPA dynamically configures batch size, replication, and model variants to optimize accuracy, minimize costs, and meet user-defined latency Service Level Agreements (SLAs) using Integer Programming. It supports multi-objective settings for achieving different trade-offs between accuracy and cost objectives while remaining adaptable to varying workloads and dynamic traffic patterns. Navigating a wider variety of configurations allows \namex{} to achieve better trade-offs between cost and accuracy objectives compared to existing methods. Extensive experiments in a Kubernetes implementation with five real-world inference pipelines demonstrate that IPA improves end-to-end accuracy by up to 21% with a minimal cost increase. The code and data for replications are available at https://github.com/reconfigurable-ml-pipeline/ipa.
Paper Structure (23 sections, 11 equations, 18 figures, 15 tables)

This paper contains 23 sections, 11 equations, 18 figures, 15 tables.

Figures (18)

  • Figure 1: IPA provides a tunable framework for adjusting the system based on two contradictory cost and accuracy objectives.
  • Figure 2: Performance difference across ResNet Family models for a batch size of one and one CPU core allocation.
  • Figure 3: Impact of configuration knobs, batching indirectly affects the cost, e.g., decreasing the throughput will affect the IPA to more scaling and increase in the cost.
  • Figure 4: IPA system design. It consists of an offline phase for model profiling and an online phase for adaptive inference serving.
  • Figure 5: Switching between different configurations under (a) low and (b) high loads.
  • ...and 13 more figures