Low-Rank GEMM: Efficient Matrix Multiplication via Low-Rank Approximation with FP8 Acceleration
Alfredo Metere
TL;DR
Low-Rank GEMM introduces a production-ready framework that uses low-rank approximations to reduce the complexity and memory footprint of large-scale GEMMs while exploiting FP8 hardware acceleration. The method adaptively selects ranks and decomposition methods, and leverages TensorCores with FP8 storage and FP32 accumulation to balance throughput and numerical stability. Across benchmarks on an RTX 4090, it achieves up to 378 TFLOPS and 75% memory savings, with a crossover to the LowRank approach around matrices of size $N \approx 10^4$ where bandwidth becomes the primary bottleneck. The work demonstrates that near-bandwidth-limited performance is achievable for large-scale matrix multiplications, enabling training and deployment of larger models and more throughput-efficient inference in real-world ML workloads.
Abstract
Large matrix multiplication is a cornerstone of modern machine learning workloads, yet traditional approaches suffer from cubic computational complexity (e.g., $\mathcal{O}(n^3)$ for a matrix of size $n\times n$). We present Low-Rank GEMM, a novel approach that leverages low-rank matrix approximations to achieve sub-quadratic complexity while maintaining hardware-accelerated performance through FP8 precision and intelligent kernel selection. On a NVIDIA RTX 4090, our implementation achieves up to 378 TFLOPS on matrices up to $N=20480$, providing 75\% memory savings and $7.8\times$ speedup over PyTorch FP32 for large matrices. The system automatically adapts to hardware capabilities, selecting optimal decomposition methods (SVD, randomized SVD) and precision levels based on matrix characteristics and available accelerators. Comprehensive benchmarking on NVIDIA RTX 4090 demonstrates that Low-Rank GEMM becomes the fastest approach for matrices $N\geq10240$, surpassing traditional cuBLAS implementations through memory bandwidth optimization rather than computational shortcuts.
