Table of Contents
Fetching ...

Scattered Mixture-of-Experts Implementation

Shawn Tan, Yikang Shen, Rameswar Panda, Aaron Courville

TL;DR

ScatterMoE delivers a GPU-friendly Sparse Mixture-of-Experts implementation that minimizes memory overhead without padding by fusing grouping with linear transforms through a ParalellLinear primitive. Built on Triton, it enables efficient SMoE MLP and extends to Mixture-of-Attention, achieving higher throughput and reduced memory use than Megablocks in benchmarks. The work demonstrates scalable performance improvements across unit and end-to-end tasks and provides extensibility for future SMoE variants. While promising, decoding kernels and multi-node parallelism remain as future enhancements.

Abstract

We present ScatterMoE, an implementation of Sparse Mixture-of-Experts (SMoE) on GPUs. ScatterMoE builds upon existing implementations, and overcoming some of the limitations to improve inference and training speed, and memory footprint. This implementation achieves this by avoiding padding and making excessive copies of the input. We introduce ParallelLinear, the main component we use to build our implementation and the various kernels used to speed up the operation. We benchmark our implementation against Megablocks, and show that it enables a higher throughput and lower memory footprint. We also show how ParallelLinear enables extension of the Mixture-of-Experts concept by demonstrating with an implementation of Mixture of Attention.

Scattered Mixture-of-Experts Implementation

TL;DR

ScatterMoE delivers a GPU-friendly Sparse Mixture-of-Experts implementation that minimizes memory overhead without padding by fusing grouping with linear transforms through a ParalellLinear primitive. Built on Triton, it enables efficient SMoE MLP and extends to Mixture-of-Attention, achieving higher throughput and reduced memory use than Megablocks in benchmarks. The work demonstrates scalable performance improvements across unit and end-to-end tasks and provides extensibility for future SMoE variants. While promising, decoding kernels and multi-node parallelism remain as future enhancements.

Abstract

We present ScatterMoE, an implementation of Sparse Mixture-of-Experts (SMoE) on GPUs. ScatterMoE builds upon existing implementations, and overcoming some of the limitations to improve inference and training speed, and memory footprint. This implementation achieves this by avoiding padding and making excessive copies of the input. We introduce ParallelLinear, the main component we use to build our implementation and the various kernels used to speed up the operation. We benchmark our implementation against Megablocks, and show that it enables a higher throughput and lower memory footprint. We also show how ParallelLinear enables extension of the Mixture-of-Experts concept by demonstrating with an implementation of Mixture of Attention.
Paper Structure (17 sections, 5 equations, 8 figures, 1 table, 4 algorithms)

This paper contains 17 sections, 5 equations, 8 figures, 1 table, 4 algorithms.

Figures (8)

  • Figure 1: Current implementations of SMoE Multi-layer Perceptrons (MLPs) require a copy of the embeddings when grouping (left), while ScatterMoE fuses the grouping and linear transformation step (right), reducing the memory footprint of our method. The various colours represent different experts, while the vertical rectangular boxes represent embeddings with their associated time steps labelled above or below them.
  • Figure 2: ParallelLinear allows for performing different combinations of SMoE transformations allowing for the input and output to be either grouped or scattered. This basic functionality forms the basis of both forward and backward passes of ScatterMoE. Unlike existing implementations, these operations are performed without additional copying (or padding) of the input and output tensors.
  • Figure 3: ParallelLinear allows for scattered to scattered transformations which retains the chronological order.
  • Figure 4:
  • Figure 5: Increasing $k$ and $E$ while fixing the number of active parameters and total parameters. We find that our implementation scales better with higher $k$. Inference granularity scaling performance. The difference in relative throughput is higher if we consider only the forward pass.
  • ...and 3 more figures