Execution-Centric Characterization of FP8 Matrix Cores, Asynchronous Execution, and Structured Sparsity on AMD MI300A
Aaron Jarmusch, Connor Vitz, Sunita Chandrasekaran
TL;DR
This work delivers an execution-centric characterization of FP8 matrix cores, ACE concurrency, and 2:4 sparsity on the AMD MI300A, using targeted microbenchmarks to reveal occupancy thresholds, overlap limits, and fairness under concurrency. It demonstrates that FP8 performance is memory-latency-bound and requires high occupancy (e.g., 256+ wavefronts) to approach peak utilization, that ACE increases aggregate throughput but can severely degrade per-stream fairness, and that 2:4 sparsity yields little benefit in isolation due to constant encoding overhead but can improve fairness and throughput under contention. The findings translate to practical guidance for occupancy-aware scheduling, co-scheduling strategies (occupancy fragmentation), and context-dependent sparsity enablement in transformer-style and mixed-precision workloads on MI300A-class nodes. Overall, the work highlights the need for execution-aware runtime decisions beyond traditional peak-throughput metrics to achieve predictable and efficient performance in unified HPC–AI workloads.
Abstract
The AMD MI300A APU integrates CDNA3 GPUs with high-bandwidth memory and advanced accelerator features: FP8 matrix cores, asynchronous compute engines (ACE), and 2:4 structured sparsity. These capabilities are increasingly relied upon by modern HPC and HPC-AI workloads, yet their execution characteristics and system-level implications remain insufficiently understood. In this paper, we present an execution-centric characterization of FP8 matrix execution, ACE concurrency, and structured sparsity on MI300A using targeted microbenchmarks. We quantify occupancy thresholds, fairness, throughput trade-offs under concurrent execution, and context-dependent sparsity benefits. We evaluate representative case studies - transformer-style, concurrent, and mixed-precision kernels - to show how these effects translate into application-level performance and predictability. Our results provide practical guidance for occupancy-aware scheduling, concurrency decisions, and sparsity enablement on MI300A-class unified nodes.
