AutoTSMM: An Auto-tuning Framework for Building High-Performance Tall-and-Skinny Matrix-Matrix Multiplication on CPUs
Chendi Li, Haipeng Jia, Hang Cao, Jianyu Yao, Boqian Shi, Chunyang Xiang, Jinbo Sun, Pengqi Lu, Yunquan Zhang
TL;DR
AutoTSMM presents a portable auto-tuning framework for high-performance tall-and-skinny matrix-matrix multiplication on CPUs by combining a pre-pack TSMM strategy with a runtime tiled algorithm and architecture-aware inner-kernel optimization. The install-time stage selects optimal inner kernels while the runtime stage builds a cache- and thread-aware execution plan, enabling data reuse and reduced packing overhead. Empirical results show AutoTSMM is competitive with state-of-the-art TSMM implementations (e.g., MKL-TSMM) and often outperforms conventional GEMM on both X86 and ARMv8, with substantial speedups, particularly when data reuse is high. The framework reduces manual tuning effort and offers portability across platforms, though gains degrade when data reuse opportunities are limited.
Abstract
In recent years, general matrix-matrix multiplication with non-regular-shaped input matrices has been widely used in many applications like deep learning and has drawn more and more attention. However, conventional implementations are not suited for non-regular-shaped matrix-matrix multiplications, and few works focus on optimizing tall-and-skinny matrix-matrix multiplication on CPUs. This paper proposes an auto-tuning framework, AutoTSMM, to build high-performance tall-and-skinny matrix-matrix multiplication. AutoTSMM selects the optimal inner kernels in the install-time stage and generates an execution plan for the pre-pack tall-and-skinny matrix-matrix multiplication in the runtime stage. Experiments demonstrate that AutoTSMM achieves competitive performance comparing to state-of-the-art tall-and-skinny matrix-matrix multiplication. And, it outperforms all conventional matrix-matrix multiplication implementations.
