Execution-Centric Characterization of FP8 Matrix Cores, Asynchronous Execution, and Structured Sparsity on AMD MI300A

Aaron Jarmusch; Connor Vitz; Sunita Chandrasekaran

Execution-Centric Characterization of FP8 Matrix Cores, Asynchronous Execution, and Structured Sparsity on AMD MI300A

Aaron Jarmusch, Connor Vitz, Sunita Chandrasekaran

TL;DR

This work delivers an execution-centric characterization of FP8 matrix cores, ACE concurrency, and 2:4 sparsity on the AMD MI300A, using targeted microbenchmarks to reveal occupancy thresholds, overlap limits, and fairness under concurrency. It demonstrates that FP8 performance is memory-latency-bound and requires high occupancy (e.g., 256+ wavefronts) to approach peak utilization, that ACE increases aggregate throughput but can severely degrade per-stream fairness, and that 2:4 sparsity yields little benefit in isolation due to constant encoding overhead but can improve fairness and throughput under contention. The findings translate to practical guidance for occupancy-aware scheduling, co-scheduling strategies (occupancy fragmentation), and context-dependent sparsity enablement in transformer-style and mixed-precision workloads on MI300A-class nodes. Overall, the work highlights the need for execution-aware runtime decisions beyond traditional peak-throughput metrics to achieve predictable and efficient performance in unified HPC–AI workloads.

Abstract

The AMD MI300A APU integrates CDNA3 GPUs with high-bandwidth memory and advanced accelerator features: FP8 matrix cores, asynchronous compute engines (ACE), and 2:4 structured sparsity. These capabilities are increasingly relied upon by modern HPC and HPC-AI workloads, yet their execution characteristics and system-level implications remain insufficiently understood. In this paper, we present an execution-centric characterization of FP8 matrix execution, ACE concurrency, and structured sparsity on MI300A using targeted microbenchmarks. We quantify occupancy thresholds, fairness, throughput trade-offs under concurrent execution, and context-dependent sparsity benefits. We evaluate representative case studies - transformer-style, concurrent, and mixed-precision kernels - to show how these effects translate into application-level performance and predictability. Our results provide practical guidance for occupancy-aware scheduling, concurrency decisions, and sparsity enablement on MI300A-class unified nodes.

Execution-Centric Characterization of FP8 Matrix Cores, Asynchronous Execution, and Structured Sparsity on AMD MI300A

TL;DR

Abstract

Paper Structure (33 sections, 16 figures, 3 tables)

This paper contains 33 sections, 16 figures, 3 tables.

Introduction
Background and Motivation
Related Work
Experimental Methodology
System Configuration
Microbenchmark Design and Measurement
FP8 Matrix Core Characterization
Microbenchmark Design
Throughput Scaling and Occupancy Sensitivity
Matrix Shape Effects
MFMA Opcode Coverage and Baseline Latency
Asynchronous Compute Engine Characterization
Execution Behavior and Contention Effects
Resource Contention and Scheduling Variance
Occupancy Fragmentation Effects
...and 18 more sections

Figures (16)

Figure 1: MI300A architecture: FP8 matrix cores and ACE.
Figure 2: Throughput versus total active wavefronts (one per block), normalized to peak, for FP64, FP32, FP16, BF16, and FP8. Higher is better; dashed line at 100% would indicate peak utilization. Experiments on MI300A (see Section \ref{['sec:methodology']} for software versions).
Figure 3: Absolute throughput (GFLOPS) versus matrix aspect ratio for FP64, FP32, FP16, BF16, and FP8 at fixed total blocks. Unlike Figure \ref{['fig:fp8_occupancy']}, this figure uses raw GFLOPS so that shape sensitivity is comparable across precisions. Higher is better.
Figure 4: Speedup versus number of concurrent streams for FP32, FP16, and FP8 GEMM (512³, no contention).
Figure 5: Fairness and overlap characterization. (a) Overlap efficiency versus fairness across precisions and stream counts. (b) Contention sweep: overlap efficiency and fairness versus contention level for FP32 GEMM at four concurrent streams.
...and 11 more figures

Execution-Centric Characterization of FP8 Matrix Cores, Asynchronous Execution, and Structured Sparsity on AMD MI300A

TL;DR

Abstract

Execution-Centric Characterization of FP8 Matrix Cores, Asynchronous Execution, and Structured Sparsity on AMD MI300A

Authors

TL;DR

Abstract

Table of Contents

Figures (16)