LRAMM -- Low precision approximates GEMM via RSVD

Hongyaoxing Gu

LRAMM -- Low precision approximates GEMM via RSVD

Hongyaoxing Gu

TL;DR

LRAMM targets fast, accurate approximate matrix multiplication by fusing mixed-precision quantized GEMMs with RSVD-based low-rank decompositions. The method decomposes A and B via RSVD to rank-$r$ approximations, then composes a low-rank product with three quantized GEMMs, while a formal error analysis bounds quantization, RSVD, and interaction effects. The work provides time-complexity analysis, guidance on parameter selection, and extensive empirical evaluation across scales and distributions, demonstrating speedups with controllable accuracy, especially when input matrices exhibit low-rank structure. This approach offers practical implications for accelerating large-scale ML and scientific computing workloads on mixed-precision hardware.

Abstract

Matrix multiplication computation acceleration has been a research hotspot across various domains. Due to the characteristics of some applications, approximate matrix multiplication can achieve significant performance improvements without losing much precision. In this paper, we propose LRAMM - a high-performance matrix multiplication approximation algorithm that combines mixed-precision quantized matrix multiplication with RSVD techniques, further enhancing efficiency within the error range of low-precision matrix multiplication by utilizing matrix low-rank decomposition technology.

LRAMM -- Low precision approximates GEMM via RSVD

TL;DR

LRAMM targets fast, accurate approximate matrix multiplication by fusing mixed-precision quantized GEMMs with RSVD-based low-rank decompositions. The method decomposes A and B via RSVD to rank-

approximations, then composes a low-rank product with three quantized GEMMs, while a formal error analysis bounds quantization, RSVD, and interaction effects. The work provides time-complexity analysis, guidance on parameter selection, and extensive empirical evaluation across scales and distributions, demonstrating speedups with controllable accuracy, especially when input matrices exhibit low-rank structure. This approach offers practical implications for accelerating large-scale ML and scientific computing workloads on mixed-precision hardware.

Abstract

Paper Structure (23 sections, 69 equations, 5 figures, 1 table, 4 algorithms)

This paper contains 23 sections, 69 equations, 5 figures, 1 table, 4 algorithms.

Introduction
Related works
Approximating matrix by SVD
Random SVD algorithms
Approximate Matrix Multiplication
Algorithm
Quantization GEMM
Mixed low precision RSVD AMM
Approximation Error Analysis
Time complexity analysis
Parameter turning
Evaluation
Time proportion test
Precision test
Algorithm efficiency test
...and 8 more sections

Figures (5)

Figure 1: Using RSVD for low-rank approximation of images, where (a) represents the original image, and (b, c) are the full-size images using 8-bit and 4-bit quantization, respectively. (d-i) are images that use different approximation ranks and quantization bit-widths, denoted as Lowrank(d1, d2, r), where $d_1$ and $d_2$ represent the quantization bit-widths for the two matrices $U\Sigma$ and $V$, and $r$ is the rank for the low-rank approximation.
Figure 2: The proportion of operator execution time in LARMM across different scales, where GEMM1$\sim$ 3 denote the running times of the three low-precision matrix multiplications in Algorithm 3, RSVD represents the running time for randomized SVD, and package refers to the running time for quantization operations and other matrix operations.
Figure 3: The relative error and the approximate rank of LARMM with different parameters under matrices of various distributions. Where $DQ4$ and $DQ8$ denote the fully-sized quantized matrix multiplication with 4-bit and 8-bit precision, respectively. $LARMM(d_1d_2d_3)$ indicates that the LARMM algorithm uses low-precision quantization bit-widths $d_1$, $d_2$, $d_3$ for the three steps of matrix multiplication. $LARMM(float)$ signifies that the matrix multiplication in LARMM is directly computed using single precision without employing low-precision quantization.
Figure 4: The distribution of singular values in matrices under different distributions, where the matrix dimensions are $100 \times 100$.
Figure 5: The acceleration ratio of LARMM with different bits in different approximate rank. The full quantization matrix multiplication using uint16 as the baseline, where LARMM-uint32 indicates that the matrix multiplication employs 32-bit computation, and LARMM-uint16 signifies that the matrix multiplication uses 16-bit computation.

LRAMM -- Low precision approximates GEMM via RSVD

TL;DR

Abstract

LRAMM -- Low precision approximates GEMM via RSVD

Authors

TL;DR

Abstract

Table of Contents

Figures (5)