Table of Contents
Fetching ...

Hello SME! Generating Fast Matrix Multiplication Kernels Using the Scalable Matrix Extension

Stefan Remke, Alexander Breuer

TL;DR

This paper presents an in-depth study of SME on M4, and designs a just-in-time code generator for SME-based small matrix multiplications that outperform the vendor-optimized BLAS implementation in almost all tested configurations.

Abstract

Modern central processing units (CPUs) feature single-instruction, multiple-data pipelines to accelerate compute-intensive floating-point and fixed-point workloads. Traditionally, these pipelines and corresponding instruction set architectures (ISAs) were designed for vector parallelism. In recent years, major hardware vendors have further increased the throughput of their CPUs by introducing matrix units with corresponding ISA extensions. The Scalable Matrix Extension (SME) has been announced for the Arm architecture in 2021 and Apple's M4 chip is the first to support SME. This paper presents an in-depth study of SME on M4. Our microbenchmarks determine the maximum floating-point and fixed-point throughput of M4's SME acceleration and study the achievable bandwidth for transfers to and from the matrix registers. Furthermore, we used the insights gained to design a just-in-time code generator for SME-based small matrix multiplications. The results presented show that M4's SME support is FP32-centric, with an achievable throughput of over 2.3 FP32 TFLOPS. To maximize read and write bandwidth, loading and storing to and from the matrix registers must be done in two steps. Our just-in-time generated small matrix multiplication kernels outperform the vendor-optimized BLAS implementation in almost all tested configurations.

Hello SME! Generating Fast Matrix Multiplication Kernels Using the Scalable Matrix Extension

TL;DR

This paper presents an in-depth study of SME on M4, and designs a just-in-time code generator for SME-based small matrix multiplications that outperform the vendor-optimized BLAS implementation in almost all tested configurations.

Abstract

Modern central processing units (CPUs) feature single-instruction, multiple-data pipelines to accelerate compute-intensive floating-point and fixed-point workloads. Traditionally, these pipelines and corresponding instruction set architectures (ISAs) were designed for vector parallelism. In recent years, major hardware vendors have further increased the throughput of their CPUs by introducing matrix units with corresponding ISA extensions. The Scalable Matrix Extension (SME) has been announced for the Arm architecture in 2021 and Apple's M4 chip is the first to support SME. This paper presents an in-depth study of SME on M4. Our microbenchmarks determine the maximum floating-point and fixed-point throughput of M4's SME acceleration and study the achievable bandwidth for transfers to and from the matrix registers. Furthermore, we used the insights gained to design a just-in-time code generator for SME-based small matrix multiplications. The results presented show that M4's SME support is FP32-centric, with an achievable throughput of over 2.3 FP32 TFLOPS. To maximize read and write bandwidth, loading and storing to and from the matrix registers must be done in two steps. Our just-in-time generated small matrix multiplication kernels outperform the vendor-optimized BLAS implementation in almost all tested configurations.
Paper Structure (18 sections, 9 figures, 1 table)

This paper contains 18 sections, 9 figures, 1 table.

Figures (9)

  • Figure 1: Comparison of multi-core performance for the FP32 Neon FMLA (vector) instruction and the FP32 FMOPA (non-widening instruction). Performance is shown as the number of user-interactive threads increases.
  • Figure 2: Bandwidth of different strategies for loading data from memory into the ZA array. The LDR variant loads directly from memory into the ZA array, while the other strategies first load into one, two, or four vector registers (VR) and then copy the data into the ZA array. The loaded data is 128-byte aligned.
  • Figure 3: Bandwidth of different strategies for storing data from the ZA array to memory. The STR variant stores directly from the ZA array to memory, while the other strategies first copy to one, two, or four vector registers (VR) and then store the data from the vector register(s) to memory. The stored data is 128-byte aligned.
  • Figure 4: Bandwidth of different strategies for loading data from memory into the ZA array, considering memory alignment. Subfigures (a) - (d) show the different load variants, where the colors denote 16-byte, 32-byte, 64-byte, and 128-byte alignment of the data.
  • Figure 5: Bandwidth of different strategies for storing data from the ZA array into memory, considering memory alignment. Subfigures (a) - (d) show the different store variants, where the colors denote 16-byte, 32-byte, 64-byte, and 128-byte alignment of the data.
  • ...and 4 more figures