Table of Contents
Fetching ...

Evaluation of computational and energy performance in matrix multiplication algorithms on CPU and GPU using MKL, cuBLAS and SYCL

L. A. Torres, Carlos J. Barrios H, Yves Denneulin

TL;DR

The paper evaluates CPU–GPU performance and energy for $N\times N$ GEMM on FP32 across MKL, cuBLAS, and SYCL, spanning multiple generations of Intel CPUs and NVIDIA GPUs. It compares CUDA with Tensor Cores, AVX2/AVX512 intrinsics, OpenMP, and SYCL, measuring execution time and energy via PAPI and PERF, with $32\times32$ to $8192\times8192$ matrices and multiple repetitions. Key findings show MKL on CPU and cuBLAS (without Tensor Cores) on GPU deliver the fastest times, while SYCL on CPU achieves high accuracy; Tensor Cores dramatically boost GPU speed but can increase MSE relative to CPU baselines. The results highlight that modern CPUs can match or exceed GPU performance for large GEMMs at the cost of higher power consumption, and that hardware characteristics (cores, clock, PCIe bandwidth, compute capability) strongly condition these outcomes, underscoring the need for re-evaluation on newer generations.

Abstract

Matrix multiplication is fundamental in the backpropagation algorithm used to train deep neural network models. Libraries like Intel's MKL or NVIDIA's cuBLAS implemented new and optimized matrix multiplication techniques that increase performance and reduce computational costs. These techniques can also be implemented in CUDA and SYCL and functions with AVX2 and AVX512 instructions, which have lower performance but better precision. The study compares execution times and power consumption using PAPI and PERF and compares accuracy for different matrix sizes. Comparisons were made on architectures such as third and fourth-generation Intel CPUs and NVIDIA V100 and A100 GPUs. The MKL library showed the best performance with a slight loss of precision, while OpenMP and SYCL on the CPU implementation showed the best accuracy but a loss of performance. On the other hand, the results on GPU showed that cuBLAS with tensor cores had the best performance; however, it had a cost in accuracy. The cuBLAS library without these specialized cores shows minimal performance loss and much higher accuracy. The data obtained on different architectures showed that the CPU could achieve performance close to that obtained on the GPU with increased power consumption. These results are conditional on certain hardware specifications, such as the number of cores, clock frequency, processor generation for the CPU, and the speed and bandwidth of the PCI bus and device architecture (compute capability) for the GPU.

Evaluation of computational and energy performance in matrix multiplication algorithms on CPU and GPU using MKL, cuBLAS and SYCL

TL;DR

The paper evaluates CPU–GPU performance and energy for GEMM on FP32 across MKL, cuBLAS, and SYCL, spanning multiple generations of Intel CPUs and NVIDIA GPUs. It compares CUDA with Tensor Cores, AVX2/AVX512 intrinsics, OpenMP, and SYCL, measuring execution time and energy via PAPI and PERF, with to matrices and multiple repetitions. Key findings show MKL on CPU and cuBLAS (without Tensor Cores) on GPU deliver the fastest times, while SYCL on CPU achieves high accuracy; Tensor Cores dramatically boost GPU speed but can increase MSE relative to CPU baselines. The results highlight that modern CPUs can match or exceed GPU performance for large GEMMs at the cost of higher power consumption, and that hardware characteristics (cores, clock, PCIe bandwidth, compute capability) strongly condition these outcomes, underscoring the need for re-evaluation on newer generations.

Abstract

Matrix multiplication is fundamental in the backpropagation algorithm used to train deep neural network models. Libraries like Intel's MKL or NVIDIA's cuBLAS implemented new and optimized matrix multiplication techniques that increase performance and reduce computational costs. These techniques can also be implemented in CUDA and SYCL and functions with AVX2 and AVX512 instructions, which have lower performance but better precision. The study compares execution times and power consumption using PAPI and PERF and compares accuracy for different matrix sizes. Comparisons were made on architectures such as third and fourth-generation Intel CPUs and NVIDIA V100 and A100 GPUs. The MKL library showed the best performance with a slight loss of precision, while OpenMP and SYCL on the CPU implementation showed the best accuracy but a loss of performance. On the other hand, the results on GPU showed that cuBLAS with tensor cores had the best performance; however, it had a cost in accuracy. The cuBLAS library without these specialized cores shows minimal performance loss and much higher accuracy. The data obtained on different architectures showed that the CPU could achieve performance close to that obtained on the GPU with increased power consumption. These results are conditional on certain hardware specifications, such as the number of cores, clock frequency, processor generation for the CPU, and the speed and bandwidth of the PCI bus and device architecture (compute capability) for the GPU.
Paper Structure (9 sections, 9 figures, 5 tables)

This paper contains 9 sections, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Comparison of CPU execution times and MSE. a. Intel Xeon Gold 6126 (48 cores @ 2.60 GHz) b. Intel Xeon Gold 6254 (72 cores @ 3.10 GHz c. Intel Xeon Silver 4314(64 cores @ 2.40 GHz) d. Intel Xeon Gold 5320 (104 cores @ 2.20 GHz) e. Intel Xeon Gold 5315Y(32 cores @ 3.20 GHz) f. Intel Xeon Platinum 8480+ (224 cores @ 2.0 GHz)
  • Figure 2: Comparison of the execution times of the two NVIDIA architectures evaluated. a. Intel Xeon Gold 6126 (48 cores @ 2.60 GHz) - Tesla V100-PCIE-32GB b. Intel Xeon Gold 5315Y(32 cores @ 3.20 GHz) - NVIDIA A100-PCIE-40GB
  • Figure 3: Comparison of the execution times of the different processors evaluated using the MKL library.
  • Figure 4: Comparison of the execution times of the different GPUs evaluated using the cuBLAS library.
  • Figure 5: Comparison between the best CPU and GPU execution times. Intel Xeon Platinum 8480+, Intel Xeon Gold 5320, and NVIDIA A100-PCIE-40GB GPU. MKL Vs. cuBLAS
  • ...and 4 more figures