Table of Contents
Fetching ...

On the Cost of Model-Serving Frameworks: An Experimental Evaluation

Pasquale De Rosa, Yérom-David Bromberg, Pascal Felber, Djob Mvondo, Valerio Schiavoni

TL;DR

It is demonstrated that TensorFlow Serving is able to outperform all the other frameworks in serving deep learning (DL) models, and that DL-specific frameworks (TensorFlow Serving and TorchServe) display significantly lower latencies than the three general-purpose ML frameworks.

Abstract

In machine learning (ML), the inference phase is the process of applying pre-trained models to new, unseen data with the objective of making predictions. During the inference phase, end-users interact with ML services to gain insights, recommendations, or actions based on the input data. For this reason, serving strategies are nowadays crucial for deploying and managing models in production environments effectively. These strategies ensure that models are available, scalable, reliable, and performant for real-world applications, such as time series forecasting, image classification, natural language processing, and so on. In this paper, we evaluate the performances of five widely-used model serving frameworks (TensorFlow Serving, TorchServe, MLServer, MLflow, and BentoML) under four different scenarios (malware detection, cryptocoin prices forecasting, image classification, and sentiment analysis). We demonstrate that TensorFlow Serving is able to outperform all the other frameworks in serving deep learning (DL) models. Moreover, we show that DL-specific frameworks (TensorFlow Serving and TorchServe) display significantly lower latencies than the three general-purpose ML frameworks (BentoML, MLFlow, and MLServer).

On the Cost of Model-Serving Frameworks: An Experimental Evaluation

TL;DR

It is demonstrated that TensorFlow Serving is able to outperform all the other frameworks in serving deep learning (DL) models, and that DL-specific frameworks (TensorFlow Serving and TorchServe) display significantly lower latencies than the three general-purpose ML frameworks.

Abstract

In machine learning (ML), the inference phase is the process of applying pre-trained models to new, unseen data with the objective of making predictions. During the inference phase, end-users interact with ML services to gain insights, recommendations, or actions based on the input data. For this reason, serving strategies are nowadays crucial for deploying and managing models in production environments effectively. These strategies ensure that models are available, scalable, reliable, and performant for real-world applications, such as time series forecasting, image classification, natural language processing, and so on. In this paper, we evaluate the performances of five widely-used model serving frameworks (TensorFlow Serving, TorchServe, MLServer, MLflow, and BentoML) under four different scenarios (malware detection, cryptocoin prices forecasting, image classification, and sentiment analysis). We demonstrate that TensorFlow Serving is able to outperform all the other frameworks in serving deep learning (DL) models. Moreover, we show that DL-specific frameworks (TensorFlow Serving and TorchServe) display significantly lower latencies than the three general-purpose ML frameworks (BentoML, MLFlow, and MLServer).

Paper Structure

This paper contains 11 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Stacked percentile chart of the average inference time for the serving frameworks in Scenario 1.
  • Figure 2: Stacked percentile chart of the average inference time for the serving frameworks in Scenario 2.
  • Figure 3: Stacked percentile chart of the average inference time for the serving frameworks in Scenario 3.
  • Figure 4: Stacked percentile chart of the average inference time for the serving frameworks in Scenario 4.
  • Figure 5: Cumulative distribution function of the request turn-around time for the serving frameworks in Scenario 1.
  • ...and 2 more figures