Table of Contents
Fetching ...

MOPAR: A Model Partitioning Framework for Deep Learning Inference Services on Serverless Platforms

Jiaang Duan, Shiyou Qian, Dingyu Yang, Hanwen Hu, Jian Cao, Guangtao Xue

TL;DR

This paper addresses long time series forecasting (LTSF) by recognizing dominant patterns in time series rather than treating each series uniformly. It proposes PRNet, which combines two similarity evaluation metrics with a novel pattern attention mechanism to extract and leverage recurring patterns for forecasting. The approach is validated on nine diverse datasets, where PRNet achieves higher efficiency and accuracy than state-of-the-art LTSF methods. The authors suggest future work to model relationships among different pattern types to support tasks such as imputation, classification, and anomaly detection, underscoring the broader applicability of pattern-based time-series reasoning.

Abstract

With its elastic power and a pay-as-you-go cost model, the deployment of deep learning inference services (DLISs) on serverless platforms is emerging as a prevalent trend. However, the varying resource requirements of different layers in DL models hinder resource utilization and increase costs, when DLISs are deployed as a single function on serverless platforms. To tackle this problem, we propose a model partitioning framework called MOPAR. This work is based on the two resource usage patterns of DLISs: global differences and local similarity, due to the presence of resource dominant (RD) operators and layer stacking. Considering these patterns, MOPAR adopts a hybrid approach that initially divides the DL model vertically into multiple slices composed of similar layers to improve resource efficiency. Slices containing RD operators are further partitioned into multiple sub-slices, enabling parallel optimization to reduce inference latency. Moreover, MOPAR comprehensively employs data compression and share-memory techniques to offset the additional time introduced by communication between slices. We implement a prototype of MOPAR and evaluate its efficacy using four categories of 12 DL models on OpenFaaS and AWS Lambda. The experiment results show that MOPAR can improve the resource efficiency of DLISs by 27.62\% on average, while reducing latency by about 5.52\%. Furthermore, based on Lambda's pricing, the cost of running DLISs is reduced by about 2.58 $\times$ using MOPAR.

MOPAR: A Model Partitioning Framework for Deep Learning Inference Services on Serverless Platforms

TL;DR

This paper addresses long time series forecasting (LTSF) by recognizing dominant patterns in time series rather than treating each series uniformly. It proposes PRNet, which combines two similarity evaluation metrics with a novel pattern attention mechanism to extract and leverage recurring patterns for forecasting. The approach is validated on nine diverse datasets, where PRNet achieves higher efficiency and accuracy than state-of-the-art LTSF methods. The authors suggest future work to model relationships among different pattern types to support tasks such as imputation, classification, and anomaly detection, underscoring the broader applicability of pattern-based time-series reasoning.

Abstract

With its elastic power and a pay-as-you-go cost model, the deployment of deep learning inference services (DLISs) on serverless platforms is emerging as a prevalent trend. However, the varying resource requirements of different layers in DL models hinder resource utilization and increase costs, when DLISs are deployed as a single function on serverless platforms. To tackle this problem, we propose a model partitioning framework called MOPAR. This work is based on the two resource usage patterns of DLISs: global differences and local similarity, due to the presence of resource dominant (RD) operators and layer stacking. Considering these patterns, MOPAR adopts a hybrid approach that initially divides the DL model vertically into multiple slices composed of similar layers to improve resource efficiency. Slices containing RD operators are further partitioned into multiple sub-slices, enabling parallel optimization to reduce inference latency. Moreover, MOPAR comprehensively employs data compression and share-memory techniques to offset the additional time introduced by communication between slices. We implement a prototype of MOPAR and evaluate its efficacy using four categories of 12 DL models on OpenFaaS and AWS Lambda. The experiment results show that MOPAR can improve the resource efficiency of DLISs by 27.62\% on average, while reducing latency by about 5.52\%. Furthermore, based on Lambda's pricing, the cost of running DLISs is reduced by about 2.58 using MOPAR.
Paper Structure (1 section)

This paper contains 1 section.

Table of Contents

  1. Conclusion