Table of Contents
Fetching ...

Comparison of Vectorization Capabilities of Different Compilers for X86 and ARM CPUs

Nazmus Sakib, Tarun Prabhu, Nandakishore Santhi, John Shalf, Abdel-Hameed A. Badawy

TL;DR

The paper evaluates the automatic vectorization capabilities of GCC, Clang, ICX, and ACFL on x86 and ARM using a modified TSVC2 benchmark to reflect real-world code. It systematically analyzes vectorization rates and the relative performance of vectorized code across both architectures and compilers, including detailed per-compiler breakdowns and cross-platform comparisons. The results show that no single compiler consistently dominates vectorization across all loops and platforms, with performance varying by loop pattern and hardware; Arm results also reveal cross-compiler differences in optimization strategies. The study provides practical guidance for building portable, high-performance vectorized code and highlights areas where compiler heuristics could be domain-specific for better results.

Abstract

Most modern processors contain vector units that simultaneously perform the same arithmetic operation over multiple sets of operands. The ability of compilers to automatically vectorize code is critical to effectively using these units. Understanding this capability is important for anyone writing compute-intensive, high-performance, and portable code. We tested the ability of several compilers to vectorize code on x86 and ARM. We used the TSVC2 suite, with modifications that made it more representative of real-world code. On x86, GCC reported 54% of the loops in the suite as having been vectorized, ICX reported 50%, and Clang, 46%. On ARM, GCC reported 56% of the loops as having been vectorized, ACFL reported 54%, and Clang, 47%. We found that the vectorized code did not always outperform the unvectorized code. In some cases, given two very similar vectorizable loops, a compiler would vectorize one but not the other. We also report cases where a compiler vectorized a loop on only one of the two platforms. Based on our experiments, we cannot definitively say if any one compiler is significantly better than the others at vectorizing code on any given platform.

Comparison of Vectorization Capabilities of Different Compilers for X86 and ARM CPUs

TL;DR

The paper evaluates the automatic vectorization capabilities of GCC, Clang, ICX, and ACFL on x86 and ARM using a modified TSVC2 benchmark to reflect real-world code. It systematically analyzes vectorization rates and the relative performance of vectorized code across both architectures and compilers, including detailed per-compiler breakdowns and cross-platform comparisons. The results show that no single compiler consistently dominates vectorization across all loops and platforms, with performance varying by loop pattern and hardware; Arm results also reveal cross-compiler differences in optimization strategies. The study provides practical guidance for building portable, high-performance vectorized code and highlights areas where compiler heuristics could be domain-specific for better results.

Abstract

Most modern processors contain vector units that simultaneously perform the same arithmetic operation over multiple sets of operands. The ability of compilers to automatically vectorize code is critical to effectively using these units. Understanding this capability is important for anyone writing compute-intensive, high-performance, and portable code. We tested the ability of several compilers to vectorize code on x86 and ARM. We used the TSVC2 suite, with modifications that made it more representative of real-world code. On x86, GCC reported 54% of the loops in the suite as having been vectorized, ICX reported 50%, and Clang, 46%. On ARM, GCC reported 56% of the loops as having been vectorized, ACFL reported 54%, and Clang, 47%. We found that the vectorized code did not always outperform the unvectorized code. In some cases, given two very similar vectorizable loops, a compiler would vectorize one but not the other. We also report cases where a compiler vectorized a loop on only one of the two platforms. Based on our experiments, we cannot definitively say if any one compiler is significantly better than the others at vectorizing code on any given platform.

Paper Structure

This paper contains 18 sections, 28 figures, 2 tables.

Figures (28)

  • Figure 1: Loops vectorized by GCC, ICX, and Clang on x86
  • Figure 2: Geometric Mean of Execution Time
  • Figure 3: Execution time of loops vectorized by GCC only
  • Figure 4: Loop s281
  • Figure 5: Execution time of loops not vectorized by GCC only
  • ...and 23 more figures