Table of Contents
Fetching ...

Enabling performance portability of data-parallel OpenMP applications on asymmetric multicore processors

Juan Carlos Saez, Fernando Castro, Manuel Prieto-Matias

TL;DR

The paper tackles performance portability of data-parallel OpenMP workloads on asymmetric multicore processors by introducing Asymmetric Iteration Distribution (AID), a set of loop-scheduling strategies implemented in libgomp. It provides three variants—AID-static, AID-hybrid, and AID-dynamic—designed to balance work across big and small cores without modifying applications, aided by a runtime estimation of per-loop speedups $SF$ during an online sampling phase. A subtle GCC modification enables the runtime to intervene in all parallel loops, making AID applicable broadly; evaluation on ARM big.LITTLE and an x86 AMP emulator shows substantial gains over conventional static and dynamic scheduling, with AID-static and AID-hybrid delivering up to 56% improvements and AID-dynamic improving over dynamic by up to 16.8% on the ARM platform. Overall, the work demonstrates a practical path to performance portability for unmodified data-parallel OpenMP programs on AMPs, reducing runtime overhead and improving resource utilization across heterogeneous cores.

Abstract

Asymmetric multicore processors (AMPs) couple high-performance big cores and low-power small cores with the same instruction-set architecture but different features, such as clock frequency or microarchitecture. Previous work has shown that asymmetric designs may deliver higher energy efficiency than symmetric multicores for diverse workloads. Despite their benefits, AMPs pose significant challenges to runtime systems of parallel programming models. While previous work has mainly explored how to efficiently execute task-based parallel applications on AMPs, via enhancements in the runtime system, improving the performance of unmodified data-parallel applications on these architectures is still a big challenge. In this work we analyze the particular case of loop-based OpenMP applications, which are widely used today in scientific and engineering domains, and constitute the dominant application type in many parallel benchmark suites used for performance evaluation on multicore systems. We observed that conventional loop-scheduling OpenMP approaches are unable to efficiently cope with the load imbalance that naturally stems from the different performance delivered by big and small cores. To address this shortcoming, we propose \textit{Asymmetric Iteration Distribution} (AID), a set of novel loop-scheduling methods for AMPs that distribute iterations unevenly across worker threads to efficiently deal with performance asymmetry. We implemented AID in \textit{libgomp} --the GNU OpenMP runtime system--, and evaluated it on two different asymmetric multicore platforms. Our analysis reveals that the AID methods constitute effective replacements of the \texttt{static} and \texttt{dynamic} methods on AMPs, and are capable of improving performance over these conventional strategies by up to 56\% and 16.8\%, respectively.

Enabling performance portability of data-parallel OpenMP applications on asymmetric multicore processors

TL;DR

The paper tackles performance portability of data-parallel OpenMP workloads on asymmetric multicore processors by introducing Asymmetric Iteration Distribution (AID), a set of loop-scheduling strategies implemented in libgomp. It provides three variants—AID-static, AID-hybrid, and AID-dynamic—designed to balance work across big and small cores without modifying applications, aided by a runtime estimation of per-loop speedups during an online sampling phase. A subtle GCC modification enables the runtime to intervene in all parallel loops, making AID applicable broadly; evaluation on ARM big.LITTLE and an x86 AMP emulator shows substantial gains over conventional static and dynamic scheduling, with AID-static and AID-hybrid delivering up to 56% improvements and AID-dynamic improving over dynamic by up to 16.8% on the ARM platform. Overall, the work demonstrates a practical path to performance portability for unmodified data-parallel OpenMP programs on AMPs, reducing runtime overhead and improving resource utilization across heterogeneous cores.

Abstract

Asymmetric multicore processors (AMPs) couple high-performance big cores and low-power small cores with the same instruction-set architecture but different features, such as clock frequency or microarchitecture. Previous work has shown that asymmetric designs may deliver higher energy efficiency than symmetric multicores for diverse workloads. Despite their benefits, AMPs pose significant challenges to runtime systems of parallel programming models. While previous work has mainly explored how to efficiently execute task-based parallel applications on AMPs, via enhancements in the runtime system, improving the performance of unmodified data-parallel applications on these architectures is still a big challenge. In this work we analyze the particular case of loop-based OpenMP applications, which are widely used today in scientific and engineering domains, and constitute the dominant application type in many parallel benchmark suites used for performance evaluation on multicore systems. We observed that conventional loop-scheduling OpenMP approaches are unable to efficiently cope with the load imbalance that naturally stems from the different performance delivered by big and small cores. To address this shortcoming, we propose \textit{Asymmetric Iteration Distribution} (AID), a set of novel loop-scheduling methods for AMPs that distribute iterations unevenly across worker threads to efficiently deal with performance asymmetry. We implemented AID in \textit{libgomp} --the GNU OpenMP runtime system--, and evaluated it on two different asymmetric multicore platforms. Our analysis reveals that the AID methods constitute effective replacements of the \texttt{static} and \texttt{dynamic} methods on AMPs, and are capable of improving performance over these conventional strategies by up to 56\% and 16.8\%, respectively.
Paper Structure (9 sections, 9 figures, 2 tables)

This paper contains 9 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Execution traces obtained with the Paraver tool paraver for the EP benchmark running with the static schedule and 4 threads with (a) two big and two small cores --2B-2S--, and (b) with four small cores --4S--.
  • Figure 2: Big-to-small relative performance for the first 30 loops of applications BT and CG on Platforms A and B.
  • Figure 3: State diagram describing AID-static's behavior. The lower part of each state node indicates the number of iterations that will be removed from the iteration pool by a thread in each case.
  • Figure 4: Execution traces for the EP benchmark running with the AID-static and AID-hybrid (80%) schedules with 8 threads on Platform A.
  • Figure 5: State diagram describing AID-dynamic's behavior. Note that we performed an optimization that is not reflected in this diagram. The runtime system automatically switches to the dynamic($m$) schedule, as soon as it detects that the number of iterations remaining to execute is no greater than $M\cdot{}\left(N_B+N_S\right)$. This greatly improves load balancing at the end of the loop.
  • ...and 4 more figures