Table of Contents
Fetching ...

Look-Up mAI GeMM: Increasing AI GeMMs Performance by Nearly 2.5x via msGeMM

Saeed Maleki

TL;DR

Look-Up mAI GeMM introduces msGeMM, a two-phase GeMM algorithm that uses a small look-up-table to reduce the number of multiplications and additions when model weights are low-precision. It analyzes complexity and shows that with LUT depth around 3, AI GeMM workloads can achieve about 2.5x speedups, particularly for large matrices. However, the approach relies on hardware support to perform LUT-based accumulation at high throughput, which current GPUs lack. If such hardware were available, msGeMM could significantly accelerate training and inference for transformer-based models using low-precision weights.

Abstract

AI models are increasing in size and recent advancement in the community has shown that unlike HPC applications where double precision datatype are required, lower-precision datatypes such as fp8 or int4 are sufficient to bring the same model quality both for training and inference. Following these trends, GPU vendors such as NVIDIA and AMD have added hardware support for fp16, fp8 and int8 GeMM operations with an exceptional performance via Tensor Cores. However, this paper proposes a new algorithm called msGeMM which shows that AI models with low-precision datatypes can run with ~2.5x fewer multiplication and add instructions. Efficient implementation of this algorithm requires special CUDA cores with the ability to add elements from a small look-up table at the rate of Tensor Cores.

Look-Up mAI GeMM: Increasing AI GeMMs Performance by Nearly 2.5x via msGeMM

TL;DR

Look-Up mAI GeMM introduces msGeMM, a two-phase GeMM algorithm that uses a small look-up-table to reduce the number of multiplications and additions when model weights are low-precision. It analyzes complexity and shows that with LUT depth around 3, AI GeMM workloads can achieve about 2.5x speedups, particularly for large matrices. However, the approach relies on hardware support to perform LUT-based accumulation at high throughput, which current GPUs lack. If such hardware were available, msGeMM could significantly accelerate training and inference for transformer-based models using low-precision weights.

Abstract

AI models are increasing in size and recent advancement in the community has shown that unlike HPC applications where double precision datatype are required, lower-precision datatypes such as fp8 or int4 are sufficient to bring the same model quality both for training and inference. Following these trends, GPU vendors such as NVIDIA and AMD have added hardware support for fp16, fp8 and int8 GeMM operations with an exceptional performance via Tensor Cores. However, this paper proposes a new algorithm called msGeMM which shows that AI models with low-precision datatypes can run with ~2.5x fewer multiplication and add instructions. Efficient implementation of this algorithm requires special CUDA cores with the ability to add elements from a small look-up table at the rate of Tensor Cores.
Paper Structure (13 sections, 11 equations, 3 figures)

This paper contains 13 sections, 11 equations, 3 figures.

Figures (3)

  • Figure 1: Multiplication of an MLP matrix $M$ by an activation vector $x$ with an output vector $y$.
  • Figure 2: A 2D block of the look-up table $L$ for Figure \ref{['fig:mxy']} with $d=2$. Note that $L$ is actually a 3D table and this is only showing a 2D block of it by fixing the last dimension.
  • Figure 3: Comparing the performance of the 2-phase algorithm against the naive computation of GeMM.