Table of Contents
Fetching ...

Optimizing Distributed Deployment of Mixture-of-Experts Model Inference in Serverless Computing

Mengfan Liu, Wei Wang, Chuan Wu

TL;DR

This work addresses the cost-efficiency of serving large Mixture-of-Experts (MoE) models on CPU-based serverless platforms, tackling skewed expert popularity and scatter-gather bottlenecks. It introduces a Bayesian optimization framework with a multi-dimensional epsilon-greedy search to jointly learn expert selections and distributed MoE deployment, including a Bayesian posterior for token-to-expert mapping, scalable scatter-gather designs, and an MIQCP-based deployment optimizer. Implemented on AWS Lambda with PyTorch and Optuna, the approach yields substantial billed-cost reductions (at least $75.67\%$ vs CPU clusters and $43.41\%$ vs LambdaML) while maintaining satisfactory throughput. The results demonstrate the practical viability of cost-efficient serverless MoE inference for large models without reliance on GPUs, enabling scalable, on-demand inference serving.

Abstract

With the advancement of serverless computing, running machine learning (ML) inference services over a serverless platform has been advocated, given its labor-free scalability and cost effectiveness. Mixture-of-Experts (MoE) models have been a dominant type of model architectures to enable large models nowadays, with parallel expert networks. Serving large MoE models on serverless computing is potentially beneficial, but has been underexplored due to substantial challenges in handling the skewed expert popularity and scatter-gather communication bottleneck in MoE model execution, for cost-efficient serverless MoE deployment and performance guarantee. We study optimized MoE model deployment and distributed inference serving on a serverless platform, that effectively predict expert selection, pipeline communication with model execution, and minimize the overall billed cost of serving MoE models. Especially, we propose a Bayesian optimization framework with multi-dimensional epsilon-greedy search to learn expert selections and optimal MoE deployment achieving optimal billed cost, including: 1) a Bayesian decision-making method for predicting expert popularity; 2) flexibly pipelined scatter-gather communication; and 3) an optimal model deployment algorithm for distributed MoE serving. Extensive experiments on AWS Lambda show that our designs reduce the billed cost of all MoE layers by at least 75.67% compared to CPU clusters while maintaining satisfactory inference throughput. As compared to LambdaML in serverless computing, our designs achieves 43.41% lower cost with a throughput decrease of at most 18.76%.

Optimizing Distributed Deployment of Mixture-of-Experts Model Inference in Serverless Computing

TL;DR

This work addresses the cost-efficiency of serving large Mixture-of-Experts (MoE) models on CPU-based serverless platforms, tackling skewed expert popularity and scatter-gather bottlenecks. It introduces a Bayesian optimization framework with a multi-dimensional epsilon-greedy search to jointly learn expert selections and distributed MoE deployment, including a Bayesian posterior for token-to-expert mapping, scalable scatter-gather designs, and an MIQCP-based deployment optimizer. Implemented on AWS Lambda with PyTorch and Optuna, the approach yields substantial billed-cost reductions (at least vs CPU clusters and vs LambdaML) while maintaining satisfactory throughput. The results demonstrate the practical viability of cost-efficient serverless MoE inference for large models without reliance on GPUs, enabling scalable, on-demand inference serving.

Abstract

With the advancement of serverless computing, running machine learning (ML) inference services over a serverless platform has been advocated, given its labor-free scalability and cost effectiveness. Mixture-of-Experts (MoE) models have been a dominant type of model architectures to enable large models nowadays, with parallel expert networks. Serving large MoE models on serverless computing is potentially beneficial, but has been underexplored due to substantial challenges in handling the skewed expert popularity and scatter-gather communication bottleneck in MoE model execution, for cost-efficient serverless MoE deployment and performance guarantee. We study optimized MoE model deployment and distributed inference serving on a serverless platform, that effectively predict expert selection, pipeline communication with model execution, and minimize the overall billed cost of serving MoE models. Especially, we propose a Bayesian optimization framework with multi-dimensional epsilon-greedy search to learn expert selections and optimal MoE deployment achieving optimal billed cost, including: 1) a Bayesian decision-making method for predicting expert popularity; 2) flexibly pipelined scatter-gather communication; and 3) an optimal model deployment algorithm for distributed MoE serving. Extensive experiments on AWS Lambda show that our designs reduce the billed cost of all MoE layers by at least 75.67% compared to CPU clusters while maintaining satisfactory inference throughput. As compared to LambdaML in serverless computing, our designs achieves 43.41% lower cost with a throughput decrease of at most 18.76%.
Paper Structure (23 sections, 12 equations, 14 figures, 2 algorithms)

This paper contains 23 sections, 12 equations, 14 figures, 2 algorithms.

Figures (14)

  • Figure 1: Overview of MoE model deployment on a serverless platform.
  • Figure 2: Billed cost of all MoE layers and inference throughput of a GPT-2-based MoE model.
  • Figure 3: Number of tokens with token ID 10424 (from the Enwiki8 dataset) routed to different experts at the 2nd MoE layer in Bert-based MoE model.
  • Figure 4: Billed cost of all MoE layers and end-to-end inference time of a Bert-based MoE model on AWS Lambda (tokens from Enwiki8 dataset; payload size 6MB ).
  • Figure 5: System Overview.
  • ...and 9 more figures