Table of Contents
Fetching ...

LP-GEMM: Integrating Layout Propagation into GEMM Operations

César Guedes Carneiro, Lucas Alvarenga, Guido Araujo, Sandro Rigo

Abstract

In Scientific Computing and modern Machine Learning (ML) workloads, sequences of dependent General Matrix Multiplications (GEMMs) often dominate execution time. While state-of-the-art BLAS libraries aggressively optimize individual GEMM calls, they remain constrained by the BLAS API, which requires each call to independently pack input matrices and restore outputs to a canonical memory layout. In sequential GEMMs, these constraints cause redundant packing and unpacking, wasting valuable computational resources. This paper introduces LP-GEMM, a decomposition of the GEMM kernel that enables packing-layout propagation across sequential GEMM operations. This approach eliminates unnecessary data repacking while preserving full BLAS semantic correctness at the boundaries. We evaluate LP-GEMM on x86 (AVX-512) and RISC-V (RVV 1.0) architectures across MLP-like and Attention-like workloads. Our results show average speedups of 2.25x over OpenBLAS on Intel x86 for sequential GEMMs and competitive gains relative to vendor-optimized libraries such as Intel MKL. We demonstrate the practicality of the approach beyond microbenchmarks by implementing a standalone C++ version of the Llama-3.2 inference path using exclusively BLAS-level GEMM calls. These results confirm that leveraging data layout propagation between operations can significantly boost performance.

LP-GEMM: Integrating Layout Propagation into GEMM Operations

Abstract

In Scientific Computing and modern Machine Learning (ML) workloads, sequences of dependent General Matrix Multiplications (GEMMs) often dominate execution time. While state-of-the-art BLAS libraries aggressively optimize individual GEMM calls, they remain constrained by the BLAS API, which requires each call to independently pack input matrices and restore outputs to a canonical memory layout. In sequential GEMMs, these constraints cause redundant packing and unpacking, wasting valuable computational resources. This paper introduces LP-GEMM, a decomposition of the GEMM kernel that enables packing-layout propagation across sequential GEMM operations. This approach eliminates unnecessary data repacking while preserving full BLAS semantic correctness at the boundaries. We evaluate LP-GEMM on x86 (AVX-512) and RISC-V (RVV 1.0) architectures across MLP-like and Attention-like workloads. Our results show average speedups of 2.25x over OpenBLAS on Intel x86 for sequential GEMMs and competitive gains relative to vendor-optimized libraries such as Intel MKL. We demonstrate the practicality of the approach beyond microbenchmarks by implementing a standalone C++ version of the Llama-3.2 inference path using exclusively BLAS-level GEMM calls. These results confirm that leveraging data layout propagation between operations can significantly boost performance.

Paper Structure

This paper contains 29 sections, 3 equations, 7 figures, 1 table, 2 algorithms.

Figures (7)

  • Figure 1: Comparison of OpenBLAS and LP-GEMM kernel execution of consecutive GEMM operations. The relative block sizes do not represent execution time. Weight packing is omitted for clarity.
  • Figure 2: Visual representation of GotoBLAS approach gotoblas. (a) shows how matrices are tiled for the architecture's memory hierarchy, while (b) depicts how each block $A_t$ and $B_t$ is further tiled and packed for better usage of the processor's registers during $C_T$ computing of µ Kernel. Finally, (c) shows the final GotoBLAS GEMM algorithm.
  • Figure 3: Overview of sequential GEMM operations using different data layouts. The results are the same as using OpenBLAS with column-major or row-major, respectively.
  • Figure 4: Micro-kernel Layout.
  • Figure 5: Speedup of a single GEMM extracted from gemmbench gemmbench using different state-of-art kernels compared to the three separate LP-GEMM kernels.
  • ...and 2 more figures