Table of Contents
Fetching ...

LRAMM -- Low precision approximates GEMM via RSVD

Hongyaoxing Gu

TL;DR

LRAMM targets fast, accurate approximate matrix multiplication by fusing mixed-precision quantized GEMMs with RSVD-based low-rank decompositions. The method decomposes A and B via RSVD to rank-$r$ approximations, then composes a low-rank product with three quantized GEMMs, while a formal error analysis bounds quantization, RSVD, and interaction effects. The work provides time-complexity analysis, guidance on parameter selection, and extensive empirical evaluation across scales and distributions, demonstrating speedups with controllable accuracy, especially when input matrices exhibit low-rank structure. This approach offers practical implications for accelerating large-scale ML and scientific computing workloads on mixed-precision hardware.

Abstract

Matrix multiplication computation acceleration has been a research hotspot across various domains. Due to the characteristics of some applications, approximate matrix multiplication can achieve significant performance improvements without losing much precision. In this paper, we propose LRAMM - a high-performance matrix multiplication approximation algorithm that combines mixed-precision quantized matrix multiplication with RSVD techniques, further enhancing efficiency within the error range of low-precision matrix multiplication by utilizing matrix low-rank decomposition technology.

LRAMM -- Low precision approximates GEMM via RSVD

TL;DR

LRAMM targets fast, accurate approximate matrix multiplication by fusing mixed-precision quantized GEMMs with RSVD-based low-rank decompositions. The method decomposes A and B via RSVD to rank- approximations, then composes a low-rank product with three quantized GEMMs, while a formal error analysis bounds quantization, RSVD, and interaction effects. The work provides time-complexity analysis, guidance on parameter selection, and extensive empirical evaluation across scales and distributions, demonstrating speedups with controllable accuracy, especially when input matrices exhibit low-rank structure. This approach offers practical implications for accelerating large-scale ML and scientific computing workloads on mixed-precision hardware.

Abstract

Matrix multiplication computation acceleration has been a research hotspot across various domains. Due to the characteristics of some applications, approximate matrix multiplication can achieve significant performance improvements without losing much precision. In this paper, we propose LRAMM - a high-performance matrix multiplication approximation algorithm that combines mixed-precision quantized matrix multiplication with RSVD techniques, further enhancing efficiency within the error range of low-precision matrix multiplication by utilizing matrix low-rank decomposition technology.
Paper Structure (23 sections, 69 equations, 5 figures, 1 table, 4 algorithms)

This paper contains 23 sections, 69 equations, 5 figures, 1 table, 4 algorithms.

Figures (5)

  • Figure 1: Using RSVD for low-rank approximation of images, where (a) represents the original image, and (b, c) are the full-size images using 8-bit and 4-bit quantization, respectively. (d-i) are images that use different approximation ranks and quantization bit-widths, denoted as Lowrank(d1, d2, r), where $d_1$ and $d_2$ represent the quantization bit-widths for the two matrices $U\Sigma$ and $V$, and $r$ is the rank for the low-rank approximation.
  • Figure 2: The proportion of operator execution time in LARMM across different scales, where GEMM1$\sim$ 3 denote the running times of the three low-precision matrix multiplications in Algorithm 3, RSVD represents the running time for randomized SVD, and package refers to the running time for quantization operations and other matrix operations.
  • Figure 3: The relative error and the approximate rank of LARMM with different parameters under matrices of various distributions. Where $DQ4$ and $DQ8$ denote the fully-sized quantized matrix multiplication with 4-bit and 8-bit precision, respectively. $LARMM(d_1d_2d_3)$ indicates that the LARMM algorithm uses low-precision quantization bit-widths $d_1$, $d_2$, $d_3$ for the three steps of matrix multiplication. $LARMM(float)$ signifies that the matrix multiplication in LARMM is directly computed using single precision without employing low-precision quantization.
  • Figure 4: The distribution of singular values in matrices under different distributions, where the matrix dimensions are $100 \times 100$.
  • Figure 5: The acceleration ratio of LARMM with different bits in different approximate rank. The full quantization matrix multiplication using uint16 as the baseline, where LARMM-uint32 indicates that the matrix multiplication employs 32-bit computation, and LARMM-uint16 signifies that the matrix multiplication uses 16-bit computation.