Table of Contents
Fetching ...

A method of using RSVD in residual calculation of LowBit GEMM

Hongyaoxing Gu

TL;DR

Low-rank residuals quantized matrix multiplication(LRQMM) method which introduces low-rank approximation in residual compensation for dense low precision quantization matrix multiplication which can bring several times accuracy improvement with only BLAS-2 level extra time overhead.

Abstract

The advancements of hardware technology in recent years has brought many possibilities for low-precision applications. However, the use of low precision can introduce significant computational errors, posing a considerable challenge to maintaining the computational accuracy. We propose low-rank residuals quantized matrix multiplication(LRQMM) method which introduces low-rank approximation in residual compensation for dense low precision quantization matrix multiplication. It can bring several times accuracy improvement with only BLAS-2 level extra time overhead. Moreover, LRQMM is a completely data-free quantization method that does not require additional data for pre-training. And it only works with low precision GEMM operator, which is easy to couple with other methods. Through experimentation, LRQMM can reduce the error of direct quantized matrix multiplication by 1~2 orders of magnitude, when dealing with larger matrix sizes, the computational speed is only reduced by approximately 20\%. In deep learning networks, LRQMM-4bit achieves 61.8% ImageNet Top-1 accuracy in Resnet-50, while the Direct Quant accuracy is only 8.3%.

A method of using RSVD in residual calculation of LowBit GEMM

TL;DR

Low-rank residuals quantized matrix multiplication(LRQMM) method which introduces low-rank approximation in residual compensation for dense low precision quantization matrix multiplication which can bring several times accuracy improvement with only BLAS-2 level extra time overhead.

Abstract

The advancements of hardware technology in recent years has brought many possibilities for low-precision applications. However, the use of low precision can introduce significant computational errors, posing a considerable challenge to maintaining the computational accuracy. We propose low-rank residuals quantized matrix multiplication(LRQMM) method which introduces low-rank approximation in residual compensation for dense low precision quantization matrix multiplication. It can bring several times accuracy improvement with only BLAS-2 level extra time overhead. Moreover, LRQMM is a completely data-free quantization method that does not require additional data for pre-training. And it only works with low precision GEMM operator, which is easy to couple with other methods. Through experimentation, LRQMM can reduce the error of direct quantized matrix multiplication by 1~2 orders of magnitude, when dealing with larger matrix sizes, the computational speed is only reduced by approximately 20\%. In deep learning networks, LRQMM-4bit achieves 61.8% ImageNet Top-1 accuracy in Resnet-50, while the Direct Quant accuracy is only 8.3%.
Paper Structure (18 sections, 35 equations, 5 figures, 4 tables, 2 algorithms)

This paper contains 18 sections, 35 equations, 5 figures, 4 tables, 2 algorithms.

Figures (5)

  • Figure 1: The distribution of singular values in matrices under different distributions, the dimensions are $100 \times 100$.
  • Figure 2: Illustration of the LRQMM process. We have extracted a $32x32$ output from RESNET convolutional layer as the data source for visualization. $Q$ represents the quantization operation, and $\widetilde{Q}$ represents the dequantization operation.
  • Figure 3: The algorithm's accuracy under different ranks of approximation (a), where the matrix used for testing is the uniform distribution matrix of size $200^3$. Accuracy under different matrix scales (b).
  • Figure 4: In deep learning networks, the relative error of different quantization algorithms at each layer.
  • Figure 5: (a). Time proportion of different parts of the algorithm, where PAKAGE accounts for the time needed for matrix addition, quantization, and other operations aside from the aforementioned three items. (b). Different quantization methods speedup on the GPU, and the baseline is SGEMM provided by cuBLAS.