Table of Contents
Fetching ...

AutoTSMM: An Auto-tuning Framework for Building High-Performance Tall-and-Skinny Matrix-Matrix Multiplication on CPUs

Chendi Li, Haipeng Jia, Hang Cao, Jianyu Yao, Boqian Shi, Chunyang Xiang, Jinbo Sun, Pengqi Lu, Yunquan Zhang

TL;DR

AutoTSMM presents a portable auto-tuning framework for high-performance tall-and-skinny matrix-matrix multiplication on CPUs by combining a pre-pack TSMM strategy with a runtime tiled algorithm and architecture-aware inner-kernel optimization. The install-time stage selects optimal inner kernels while the runtime stage builds a cache- and thread-aware execution plan, enabling data reuse and reduced packing overhead. Empirical results show AutoTSMM is competitive with state-of-the-art TSMM implementations (e.g., MKL-TSMM) and often outperforms conventional GEMM on both X86 and ARMv8, with substantial speedups, particularly when data reuse is high. The framework reduces manual tuning effort and offers portability across platforms, though gains degrade when data reuse opportunities are limited.

Abstract

In recent years, general matrix-matrix multiplication with non-regular-shaped input matrices has been widely used in many applications like deep learning and has drawn more and more attention. However, conventional implementations are not suited for non-regular-shaped matrix-matrix multiplications, and few works focus on optimizing tall-and-skinny matrix-matrix multiplication on CPUs. This paper proposes an auto-tuning framework, AutoTSMM, to build high-performance tall-and-skinny matrix-matrix multiplication. AutoTSMM selects the optimal inner kernels in the install-time stage and generates an execution plan for the pre-pack tall-and-skinny matrix-matrix multiplication in the runtime stage. Experiments demonstrate that AutoTSMM achieves competitive performance comparing to state-of-the-art tall-and-skinny matrix-matrix multiplication. And, it outperforms all conventional matrix-matrix multiplication implementations.

AutoTSMM: An Auto-tuning Framework for Building High-Performance Tall-and-Skinny Matrix-Matrix Multiplication on CPUs

TL;DR

AutoTSMM presents a portable auto-tuning framework for high-performance tall-and-skinny matrix-matrix multiplication on CPUs by combining a pre-pack TSMM strategy with a runtime tiled algorithm and architecture-aware inner-kernel optimization. The install-time stage selects optimal inner kernels while the runtime stage builds a cache- and thread-aware execution plan, enabling data reuse and reduced packing overhead. Empirical results show AutoTSMM is competitive with state-of-the-art TSMM implementations (e.g., MKL-TSMM) and often outperforms conventional GEMM on both X86 and ARMv8, with substantial speedups, particularly when data reuse is high. The framework reduces manual tuning effort and offers portability across platforms, though gains degrade when data reuse opportunities are limited.

Abstract

In recent years, general matrix-matrix multiplication with non-regular-shaped input matrices has been widely used in many applications like deep learning and has drawn more and more attention. However, conventional implementations are not suited for non-regular-shaped matrix-matrix multiplications, and few works focus on optimizing tall-and-skinny matrix-matrix multiplication on CPUs. This paper proposes an auto-tuning framework, AutoTSMM, to build high-performance tall-and-skinny matrix-matrix multiplication. AutoTSMM selects the optimal inner kernels in the install-time stage and generates an execution plan for the pre-pack tall-and-skinny matrix-matrix multiplication in the runtime stage. Experiments demonstrate that AutoTSMM achieves competitive performance comparing to state-of-the-art tall-and-skinny matrix-matrix multiplication. And, it outperforms all conventional matrix-matrix multiplication implementations.
Paper Structure (20 sections, 7 equations, 6 figures, 1 table, 1 algorithm)

This paper contains 20 sections, 7 equations, 6 figures, 1 table, 1 algorithm.

Figures (6)

  • Figure 1: Overview of AutoTSMM
  • Figure 2: Tiled Algorithm for the Tall-and-Skinny Matrix-Matrix Multiplication. TSMM is transformed to GEPB(panel-block multiplication), where $m_t$ is block height assigned for one thread, $k_c$ is the block width suit for L2 cache size. Since n is usually from single digits to hundreds of digits and is significantly smaller than m and k, the n-dimensional tiling algorithm will not be executed when $n \le n_c$. GEPB is transformed to GEPB$_t$(panel-block-multiplication by threads), where $m_c$ is block height suit for L2 cache size, finally GEBB$_t$(block-block-multiplication by threads) is computed as a unit by inner kernels.
  • Figure 3: Workload of The Pre-Pack Module
  • Figure 4: GEBB$_t$ Computed by Inner Kernels. The inner kernel perform a slice-times-slice matrix-matrix multiplication($m_r$ and $n_r$ are the sizes suit for register blocking).
  • Figure 5: The Percentage of the Packing Operation Time in Conventional GEMM Implementation on X86 and ARMv8 CPUs
  • ...and 1 more figures