Table of Contents
Fetching ...

Static Batching of Irregular Workloads on GPUs: Framework and Application to Efficient MoE Model Inference

Yinghan Li, Yifei Li, Jiejing Zhang, Bujiao Chen, Xiaotong Chen, Lian Duan, Yejun Jin, Zheng Li, Xuanyu Liu, Haoyu Wang, Wente Wang, Yajie Wang, Jiacheng Yang, Peiyang Zhang, Laiwen Zheng, Wenyuan Yu

TL;DR

The paper addresses the challenge of efficiently executing irregular workloads on GPUs by introducing a general static batching framework that maps multiple tasks into a single kernel via a compressed task mapping. It extends this framework to handle empty tasks and applies it to Mixture-of-Experts (MoE) model inference, introducing token-index arrays and targeted GEMM optimizations to maximize throughput. The resulting MoE kernel achieves up to about 95% of peak Tensor Core throughput on NVIDIA Hopper H20 and around 91% on H800 in favorable cases, with robust performance across balanced, best, and worst scenarios. This approach significantly improves resource utilization for irregular workloads and MoE inference, enabling more efficient large-scale inference workloads such as LLMs.

Abstract

It has long been a problem to arrange and execute irregular workloads on massively parallel devices. We propose a general framework for statically batching irregular workloads into a single kernel with a runtime task mapping mechanism on GPUs. We further apply this framework to Mixture-of-Experts (MoE) model inference and implement an optimized and efficient CUDA kernel. Our MoE kernel achieves up to 91% of the peak Tensor Core throughput on NVIDIA H800 GPU and 95% on NVIDIA H20 GPU.

Static Batching of Irregular Workloads on GPUs: Framework and Application to Efficient MoE Model Inference

TL;DR

The paper addresses the challenge of efficiently executing irregular workloads on GPUs by introducing a general static batching framework that maps multiple tasks into a single kernel via a compressed task mapping. It extends this framework to handle empty tasks and applies it to Mixture-of-Experts (MoE) model inference, introducing token-index arrays and targeted GEMM optimizations to maximize throughput. The resulting MoE kernel achieves up to about 95% of peak Tensor Core throughput on NVIDIA Hopper H20 and around 91% on H800 in favorable cases, with robust performance across balanced, best, and worst scenarios. This approach significantly improves resource utilization for irregular workloads and MoE inference, enabling more efficient large-scale inference workloads such as LLMs.

Abstract

It has long been a problem to arrange and execute irregular workloads on massively parallel devices. We propose a general framework for statically batching irregular workloads into a single kernel with a runtime task mapping mechanism on GPUs. We further apply this framework to Mixture-of-Experts (MoE) model inference and implement an optimized and efficient CUDA kernel. Our MoE kernel achieves up to 91% of the peak Tensor Core throughput on NVIDIA H800 GPU and 95% on NVIDIA H20 GPU.

Paper Structure

This paper contains 15 sections, 1 table, 4 algorithms.