Table of Contents
Fetching ...

Biathlon: Harnessing Model Resilience for Accelerating ML Inference Pipelines

Chaokun Chang, Eric Lo, Chunxiao Ye

TL;DR

Biathlon tackles the challenge of real-time ML inference pipelines where expensive online aggregations for feature preparation impede latency requirements. It presents an online, plan-based system that combines Approximate Query Processing, uncertainty propagation via quasi-M Monte Carlo, and Sobol-based feature importance to determine per-feature approximation levels, ensuring a probabilistic accuracy bound defined by $Pr(|Y-\hat{y}|\le\delta)\ge\tau$. The approach yields substantial latency reductions (approximately 5.3× to 16.6×) on seven real pipelines with minimal accuracy loss, and its performance remains tunable via $\tau$ and $\delta$ at the cost of additional computation. This work demonstrates a practical pathway to accelerate ML inference pipelines by exploiting model resilience and online feature approximation, with potential synergy against feature-store techniques in handling stale or heavy features.

Abstract

Machine learning inference pipelines commonly encountered in data science and industries often require real-time responsiveness due to their user-facing nature. However, meeting this requirement becomes particularly challenging when certain input features require aggregating a large volume of data online. Recent literature on interpretable machine learning reveals that most machine learning models exhibit a notable degree of resilience to variations in input. This suggests that machine learning models can effectively accommodate approximate input features with minimal discernible impact on accuracy. In this paper, we introduce Biathlon, a novel ML serving system that leverages the inherent resilience of models and determines the optimal degree of approximation for each aggregation feature. This approach enables maximum speedup while ensuring a guaranteed bound on accuracy loss. We evaluate Biathlon on real pipelines from both industry applications and data science competitions, demonstrating its ability to meet real-time latency requirements by achieving 5.3x to 16.6x speedup with almost no accuracy loss.

Biathlon: Harnessing Model Resilience for Accelerating ML Inference Pipelines

TL;DR

Biathlon tackles the challenge of real-time ML inference pipelines where expensive online aggregations for feature preparation impede latency requirements. It presents an online, plan-based system that combines Approximate Query Processing, uncertainty propagation via quasi-M Monte Carlo, and Sobol-based feature importance to determine per-feature approximation levels, ensuring a probabilistic accuracy bound defined by . The approach yields substantial latency reductions (approximately 5.3× to 16.6×) on seven real pipelines with minimal accuracy loss, and its performance remains tunable via and at the cost of additional computation. This work demonstrates a practical pathway to accelerate ML inference pipelines by exploiting model resilience and online feature approximation, with potential synergy against feature-store techniques in handling stale or heavy features.

Abstract

Machine learning inference pipelines commonly encountered in data science and industries often require real-time responsiveness due to their user-facing nature. However, meeting this requirement becomes particularly challenging when certain input features require aggregating a large volume of data online. Recent literature on interpretable machine learning reveals that most machine learning models exhibit a notable degree of resilience to variations in input. This suggests that machine learning models can effectively accommodate approximate input features with minimal discernible impact on accuracy. In this paper, we introduce Biathlon, a novel ML serving system that leverages the inherent resilience of models and determines the optimal degree of approximation for each aggregation feature. This approach enables maximum speedup while ensuring a guaranteed bound on accuracy loss. We evaluate Biathlon on real pipelines from both industry applications and data science competitions, demonstrating its ability to meet real-time latency requirements by achieving 5.3x to 16.6x speedup with almost no accuracy loss.
Paper Structure (19 sections, 10 equations, 14 figures, 2 tables)

This paper contains 19 sections, 10 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Decision tree example
  • Figure 2: A (simplified) inference pipeline from Kaggle
  • Figure 3: System Overview of Biathlon
  • Figure 4: Latency and Accuracy (default configuration)
  • Figure 5: Latency Breakdown of Biathlon
  • ...and 9 more figures