Table of Contents
Fetching ...

Execution-Centric Characterization of FP8 Matrix Cores, Asynchronous Execution, and Structured Sparsity on AMD MI300A

Aaron Jarmusch, Connor Vitz, Sunita Chandrasekaran

TL;DR

This work delivers an execution-centric characterization of FP8 matrix cores, ACE concurrency, and 2:4 sparsity on the AMD MI300A, using targeted microbenchmarks to reveal occupancy thresholds, overlap limits, and fairness under concurrency. It demonstrates that FP8 performance is memory-latency-bound and requires high occupancy (e.g., 256+ wavefronts) to approach peak utilization, that ACE increases aggregate throughput but can severely degrade per-stream fairness, and that 2:4 sparsity yields little benefit in isolation due to constant encoding overhead but can improve fairness and throughput under contention. The findings translate to practical guidance for occupancy-aware scheduling, co-scheduling strategies (occupancy fragmentation), and context-dependent sparsity enablement in transformer-style and mixed-precision workloads on MI300A-class nodes. Overall, the work highlights the need for execution-aware runtime decisions beyond traditional peak-throughput metrics to achieve predictable and efficient performance in unified HPC–AI workloads.

Abstract

The AMD MI300A APU integrates CDNA3 GPUs with high-bandwidth memory and advanced accelerator features: FP8 matrix cores, asynchronous compute engines (ACE), and 2:4 structured sparsity. These capabilities are increasingly relied upon by modern HPC and HPC-AI workloads, yet their execution characteristics and system-level implications remain insufficiently understood. In this paper, we present an execution-centric characterization of FP8 matrix execution, ACE concurrency, and structured sparsity on MI300A using targeted microbenchmarks. We quantify occupancy thresholds, fairness, throughput trade-offs under concurrent execution, and context-dependent sparsity benefits. We evaluate representative case studies - transformer-style, concurrent, and mixed-precision kernels - to show how these effects translate into application-level performance and predictability. Our results provide practical guidance for occupancy-aware scheduling, concurrency decisions, and sparsity enablement on MI300A-class unified nodes.

Execution-Centric Characterization of FP8 Matrix Cores, Asynchronous Execution, and Structured Sparsity on AMD MI300A

TL;DR

This work delivers an execution-centric characterization of FP8 matrix cores, ACE concurrency, and 2:4 sparsity on the AMD MI300A, using targeted microbenchmarks to reveal occupancy thresholds, overlap limits, and fairness under concurrency. It demonstrates that FP8 performance is memory-latency-bound and requires high occupancy (e.g., 256+ wavefronts) to approach peak utilization, that ACE increases aggregate throughput but can severely degrade per-stream fairness, and that 2:4 sparsity yields little benefit in isolation due to constant encoding overhead but can improve fairness and throughput under contention. The findings translate to practical guidance for occupancy-aware scheduling, co-scheduling strategies (occupancy fragmentation), and context-dependent sparsity enablement in transformer-style and mixed-precision workloads on MI300A-class nodes. Overall, the work highlights the need for execution-aware runtime decisions beyond traditional peak-throughput metrics to achieve predictable and efficient performance in unified HPC–AI workloads.

Abstract

The AMD MI300A APU integrates CDNA3 GPUs with high-bandwidth memory and advanced accelerator features: FP8 matrix cores, asynchronous compute engines (ACE), and 2:4 structured sparsity. These capabilities are increasingly relied upon by modern HPC and HPC-AI workloads, yet their execution characteristics and system-level implications remain insufficiently understood. In this paper, we present an execution-centric characterization of FP8 matrix execution, ACE concurrency, and structured sparsity on MI300A using targeted microbenchmarks. We quantify occupancy thresholds, fairness, throughput trade-offs under concurrent execution, and context-dependent sparsity benefits. We evaluate representative case studies - transformer-style, concurrent, and mixed-precision kernels - to show how these effects translate into application-level performance and predictability. Our results provide practical guidance for occupancy-aware scheduling, concurrency decisions, and sparsity enablement on MI300A-class unified nodes.
Paper Structure (33 sections, 16 figures, 3 tables)

This paper contains 33 sections, 16 figures, 3 tables.

Figures (16)

  • Figure 1: MI300A architecture: FP8 matrix cores and ACE.
  • Figure 2: Throughput versus total active wavefronts (one per block), normalized to peak, for FP64, FP32, FP16, BF16, and FP8. Higher is better; dashed line at 100% would indicate peak utilization. Experiments on MI300A (see Section \ref{['sec:methodology']} for software versions).
  • Figure 3: Absolute throughput (GFLOPS) versus matrix aspect ratio for FP64, FP32, FP16, BF16, and FP8 at fixed total blocks. Unlike Figure \ref{['fig:fp8_occupancy']}, this figure uses raw GFLOPS so that shape sensitivity is comparable across precisions. Higher is better.
  • Figure 4: Speedup versus number of concurrent streams for FP32, FP16, and FP8 GEMM (512³, no contention).
  • Figure 5: Fairness and overlap characterization. (a) Overlap efficiency versus fairness across precisions and stream counts. (b) Contention sweep: overlap efficiency and fairness versus contention level for FP32 GEMM at four concurrent streams.
  • ...and 11 more figures