Table of Contents
Fetching ...

pSTL-Bench: A Micro-Benchmark Suite for Assessing Scalability of C++ Parallel STL Implementations

Ruben Laso, Diego Krupitza, Sascha Hunold

TL;DR

This work introduces pSTL-Bench, a micro-benchmark suite tailored to quantify the scalability of parallel C++ STL algorithms across diverse backends and architectures, including multi-core CPUs and GPUs. By evaluating five representative kernels (find, for_each, reduce, inclusive_scan, sort) under varied thread counts, problem sizes, and a NUMA-aware memory allocator, the study reveals substantial cross-backend performance disparities, with Intel+TBB and NVIDIA tooling often outperforming GCC-based implementations on larger workloads. Key findings show that small problem sizes incur high parallel-launch overhead while larger sizes unlock significant speedups (up to $\sim$, in some cases), yet vectorization and data transfer overheads markedly influence results on GPUs. The work demonstrates the practical value of pSTL-Bench for guiding compiler/backend choices and motivates future expansion to more algorithms and architectures to further illuminate performance portability of parallel STL implementations.

Abstract

Since the advent of parallel algorithms in the C++17 Standard Template Library (STL), the STL has become a viable framework for creating performance-portable applications. Given multiple existing implementations of the parallel algorithms, a systematic, quantitative performance comparison is essential for choosing the appropriate implementation for a particular hardware configuration. In this work, we introduce a specialized set of micro-benchmarks to assess the scalability of the parallel algorithms in the STL. By selecting different backends, our micro-benchmarks can be used on multi-core systems and GPUs. Using the suite, in a case study on AMD and Intel CPUs and NVIDIA GPUs, we were able to identify substantial performance disparities among different implementations, including GCC+TBB, GCC+HPX, Intel's compiler with TBB, or NVIDIA's compiler with OpenMP and CUDA.

pSTL-Bench: A Micro-Benchmark Suite for Assessing Scalability of C++ Parallel STL Implementations

TL;DR

This work introduces pSTL-Bench, a micro-benchmark suite tailored to quantify the scalability of parallel C++ STL algorithms across diverse backends and architectures, including multi-core CPUs and GPUs. By evaluating five representative kernels (find, for_each, reduce, inclusive_scan, sort) under varied thread counts, problem sizes, and a NUMA-aware memory allocator, the study reveals substantial cross-backend performance disparities, with Intel+TBB and NVIDIA tooling often outperforming GCC-based implementations on larger workloads. Key findings show that small problem sizes incur high parallel-launch overhead while larger sizes unlock significant speedups (up to , in some cases), yet vectorization and data transfer overheads markedly influence results on GPUs. The work demonstrates the practical value of pSTL-Bench for guiding compiler/backend choices and motivates future expansion to more algorithms and architectures to further illuminate performance portability of parallel STL implementations.

Abstract

Since the advent of parallel algorithms in the C++17 Standard Template Library (STL), the STL has become a viable framework for creating performance-portable applications. Given multiple existing implementations of the parallel algorithms, a systematic, quantitative performance comparison is essential for choosing the appropriate implementation for a particular hardware configuration. In this work, we introduce a specialized set of micro-benchmarks to assess the scalability of the parallel algorithms in the STL. By selecting different backends, our micro-benchmarks can be used on multi-core systems and GPUs. Using the suite, in a case study on AMD and Intel CPUs and NVIDIA GPUs, we were able to identify substantial performance disparities among different implementations, including GCC+TBB, GCC+HPX, Intel's compiler with TBB, or NVIDIA's compiler with OpenMP and CUDA.
Paper Structure (23 sections, 24 figures, 4 tables)

This paper contains 23 sections, 24 figures, 4 tables.

Figures (24)

  • Figure 1: Speedup when using custom parallel allocator with 32 threads and a problem size of 2.0^30 elements in Hydra. Higher is better.
  • Figure 2: Execution time scalability with the problem size in Hydra. All cores were used except for GCC’s sequential implementation. Lower is better.
  • Figure 3: Speedup against GCC's sequential implementation for benchmark X::for_each. Higher is better.
  • Figure 4: Execution time scalability with the problem size. All cores used except for GCC’s seq. implementation. Benchmark X::inclusive_scan. Lower is better.
  • Figure 5: Speedup against GCC's seq. implementation. Problem size is $2^{30}$. Benchmark X::inclusive_scan. Higher is better.
  • ...and 19 more figures