Optimizing Distributed Deployment of Mixture-of-Experts Model Inference in Serverless Computing
Mengfan Liu, Wei Wang, Chuan Wu
TL;DR
This work addresses the cost-efficiency of serving large Mixture-of-Experts (MoE) models on CPU-based serverless platforms, tackling skewed expert popularity and scatter-gather bottlenecks. It introduces a Bayesian optimization framework with a multi-dimensional epsilon-greedy search to jointly learn expert selections and distributed MoE deployment, including a Bayesian posterior for token-to-expert mapping, scalable scatter-gather designs, and an MIQCP-based deployment optimizer. Implemented on AWS Lambda with PyTorch and Optuna, the approach yields substantial billed-cost reductions (at least $75.67\%$ vs CPU clusters and $43.41\%$ vs LambdaML) while maintaining satisfactory throughput. The results demonstrate the practical viability of cost-efficient serverless MoE inference for large models without reliance on GPUs, enabling scalable, on-demand inference serving.
Abstract
With the advancement of serverless computing, running machine learning (ML) inference services over a serverless platform has been advocated, given its labor-free scalability and cost effectiveness. Mixture-of-Experts (MoE) models have been a dominant type of model architectures to enable large models nowadays, with parallel expert networks. Serving large MoE models on serverless computing is potentially beneficial, but has been underexplored due to substantial challenges in handling the skewed expert popularity and scatter-gather communication bottleneck in MoE model execution, for cost-efficient serverless MoE deployment and performance guarantee. We study optimized MoE model deployment and distributed inference serving on a serverless platform, that effectively predict expert selection, pipeline communication with model execution, and minimize the overall billed cost of serving MoE models. Especially, we propose a Bayesian optimization framework with multi-dimensional epsilon-greedy search to learn expert selections and optimal MoE deployment achieving optimal billed cost, including: 1) a Bayesian decision-making method for predicting expert popularity; 2) flexibly pipelined scatter-gather communication; and 3) an optimal model deployment algorithm for distributed MoE serving. Extensive experiments on AWS Lambda show that our designs reduce the billed cost of all MoE layers by at least 75.67% compared to CPU clusters while maintaining satisfactory inference throughput. As compared to LambdaML in serverless computing, our designs achieves 43.41% lower cost with a throughput decrease of at most 18.76%.
