Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production

Young Jin Kim; Rawn Henry; Raffy Fahim; Hany Hassan Awadalla

Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production

Young Jin Kim, Rawn Henry, Raffy Fahim, Hany Hassan Awadalla

TL;DR

The paper tackles the challenge of deploying large-scale Mixture of Experts transformers for multilingual machine translation in production. It introduces a single-GPU inference framework that builds on FasterTransformer, adding MoE support, 4-/8-bit expert weight quantization, fused dequantization within GEMMs, and batch pruning to reduce memory traffic and latency. The approach yields up to 26x throughput speedups and substantial cost reductions, while enabling 136x larger models to run on GPU with better quality than CPU-based distillation, effectively removing the need for teacher-model distillation. This work demonstrates practical, scalable deployment of large-scale MoE transformers in cloud-scale production settings. It also outlines a path toward handling even larger models through distributed inference and further kernel optimizations.

Abstract

Mixture of Experts (MoE) models with conditional execution of sparsely activated layers have enabled training models with a much larger number of parameters. As a result, these models have achieved significantly better quality on various natural language processing tasks including machine translation. However, it remains challenging to deploy such models in real-life scenarios due to the large memory requirements and inefficient inference. In this work, we introduce a highly efficient inference framework with several optimization approaches to accelerate the computation of sparse models and cut down the memory consumption significantly. While we achieve up to 26x speed-up in terms of throughput, we also reduce the model size almost to one eighth of the original 32-bit float model by quantizing expert weights into 4-bit integers. As a result, we are able to deploy 136x larger models with 27% less cost and significantly better quality compared to the existing solutions. This enables a paradigm shift in deploying large scale multilingual MoE transformers models replacing the traditional practice of distilling teacher models into dozens of smaller models per language or task.

Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production

TL;DR

Abstract

Paper Structure (17 sections, 1 equation, 1 figure, 4 tables, 1 algorithm)

This paper contains 17 sections, 1 equation, 1 figure, 4 tables, 1 algorithm.

Introduction
Challenges and Contributions
MoE Inference challenge
Inference Optimization Contributions
FasterTransformer overview
MoE Inference Optimizations
Model architecture
Multilingual Machine Translation Model
Optimized GPU kernel design
Expert quantization with 4-bit and 8-bit
Quantization Optimization
Optimized 8-bit Dequantize
Optimized 4-bit Dequantize
MoE Batch Pruning
Results and discussion
...and 2 more sections

Figures (1)

Figure 1: Shows the computation performed by CUTLASS Grouped GEMM. Each color is a sub-matrix for a particular expert, with the matrix multiplies for each expert happening in parallel. If the yellow sentence was finished, it would be omitted from the computation with batch-pruning enabled. This would completely remove the need to load the weight matrix for the yellow expert.

Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production

TL;DR

Abstract

Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production

Authors

TL;DR

Abstract

Table of Contents

Figures (1)