Adding MFMA Support to gem5
Marco Kurzynski, Matthew D. Sinclair
TL;DR
This paper introduces Matrix Core Engine (MCE) and Matrix Fused Multiply Add (MFMA) support in the gem5 GPU model for AMD MI200 and MI300 GPUs, enabling accurate timing and behavior of MFMA workloads in simulation. The authors model MFMA as dedicated FUs per SIMD unit, with 4 MCEs per CU, and validate the timing against real hardware, achieving mean absolute percentage errors around 1.3–1.5% across tested instructions. They also provide a what-if analysis capability via a configurable --mfma-scale parameter to explore how MFMA latency changes impact ML workloads. The results demonstrate high fidelity of the gem5 MFMA implementation and show how researchers can conduct rapid, high-fidelity experimentation for modern ML workloads on simulated future systems.
Abstract
In this work we have enhanced gem5's GPU model support to add Matrix Core Engines (MCEs). Specifically, on the AMD MI200 and MI300 GPUs that gem5 supports, these MCEs perform Matrix Fused Multiply Add (MFMA) instructions for a variety of precisions. By adding this support, our changes enable running state-of-the-art ML workloads in gem5, as well as examining how MCE optimizations impact the behavior of future systems.
