Biathlon: Harnessing Model Resilience for Accelerating ML Inference Pipelines

Chaokun Chang; Eric Lo; Chunxiao Ye

Biathlon: Harnessing Model Resilience for Accelerating ML Inference Pipelines

Chaokun Chang, Eric Lo, Chunxiao Ye

TL;DR

Biathlon tackles the challenge of real-time ML inference pipelines where expensive online aggregations for feature preparation impede latency requirements. It presents an online, plan-based system that combines Approximate Query Processing, uncertainty propagation via quasi-M Monte Carlo, and Sobol-based feature importance to determine per-feature approximation levels, ensuring a probabilistic accuracy bound defined by $Pr(|Y-\hat{y}|\le\delta)\ge\tau$. The approach yields substantial latency reductions (approximately 5.3× to 16.6×) on seven real pipelines with minimal accuracy loss, and its performance remains tunable via $\tau$ and $\delta$ at the cost of additional computation. This work demonstrates a practical pathway to accelerate ML inference pipelines by exploiting model resilience and online feature approximation, with potential synergy against feature-store techniques in handling stale or heavy features.

Abstract

Machine learning inference pipelines commonly encountered in data science and industries often require real-time responsiveness due to their user-facing nature. However, meeting this requirement becomes particularly challenging when certain input features require aggregating a large volume of data online. Recent literature on interpretable machine learning reveals that most machine learning models exhibit a notable degree of resilience to variations in input. This suggests that machine learning models can effectively accommodate approximate input features with minimal discernible impact on accuracy. In this paper, we introduce Biathlon, a novel ML serving system that leverages the inherent resilience of models and determines the optimal degree of approximation for each aggregation feature. This approach enables maximum speedup while ensuring a guaranteed bound on accuracy loss. We evaluate Biathlon on real pipelines from both industry applications and data science competitions, demonstrating its ability to meet real-time latency requirements by achieving 5.3x to 16.6x speedup with almost no accuracy loss.

Biathlon: Harnessing Model Resilience for Accelerating ML Inference Pipelines

TL;DR

. The approach yields substantial latency reductions (approximately 5.3× to 16.6×) on seven real pipelines with minimal accuracy loss, and its performance remains tunable via

and

at the cost of additional computation. This work demonstrates a practical pathway to accelerate ML inference pipelines by exploiting model resilience and online feature approximation, with potential synergy against feature-store techniques in handling stale or heavy features.

Abstract

Paper Structure (19 sections, 10 equations, 14 figures, 2 tables)

This paper contains 19 sections, 10 equations, 14 figures, 2 tables.

Introduction
Background
Approximate Query Processing
Feature Importance
Biathlon
Workflow of Biathlon
Approximate Feature Computation (AFC)
Approximate Model Inference (AMI)
Planner
Evaluation
End to End Performance
Varying the confidence level $\tau$
Varying the error bound $\delta$
Related Work
Conclusion and Future work
...and 4 more sections

Figures (14)

Figure 1: Decision tree example
Figure 2: A (simplified) inference pipeline from Kaggle
Figure 3: System Overview of Biathlon
Figure 4: Latency and Accuracy (default configuration)
Figure 5: Latency Breakdown of Biathlon
...and 9 more figures

Biathlon: Harnessing Model Resilience for Accelerating ML Inference Pipelines

TL;DR

Abstract

Biathlon: Harnessing Model Resilience for Accelerating ML Inference Pipelines

Authors

TL;DR

Abstract

Table of Contents

Figures (14)