Table of Contents
Fetching ...

Energy efficiency optimization of task-parallel codes on asymmetric architectures

Luis Costero, Francisco D. Igual, Katzalin Olcoz, Francisco Tirado

TL;DR

This work tackles energy efficiency for task-parallel codes on asymmetric ARM big.LITTLE systems using a runtime-driven policy set in Nanox. It introduces two policy families: FS (DVFS-based) and TS (scheduling-based) that modulate cluster frequencies or cluster usage according to the scheduler state, aiming for energy savings with minimal perf impact. On an Exynos 5422 platform with a Cholesky factorization workload, results show FS3 achieves up to 29.3% improvement in energy efficiency, while frequency scaling of the LITTLE cluster reduces power but not always energy efficiency; TS policies offer limited gains, with TS3 reaching up to 17.1% in select configurations. The findings indicate that scaling the big cluster frequency provides the strongest energy-efficiency benefits and motivate further work on broader benchmarks and automatic policy selection.

Abstract

We present a family of policies that, integrated within a runtime task scheduler (Nanox), pursue the goal of improving the energy efficiency of task-parallel executions with no intervention from the programmer. The proposed policies tackle the problem by modifying the core operating frequency via DVFS mechanisms, or by enabling/disabling the mapping of tasks to specific cores at selected execution points, depending on the internal status of the scheduler. Experimental results on an asymmetric SoC (Exynos 5422) and for a specific operation (Cholesky factorization) reveal gains up to 29% in terms of energy efficiency and considerable reductions in average power.

Energy efficiency optimization of task-parallel codes on asymmetric architectures

TL;DR

This work tackles energy efficiency for task-parallel codes on asymmetric ARM big.LITTLE systems using a runtime-driven policy set in Nanox. It introduces two policy families: FS (DVFS-based) and TS (scheduling-based) that modulate cluster frequencies or cluster usage according to the scheduler state, aiming for energy savings with minimal perf impact. On an Exynos 5422 platform with a Cholesky factorization workload, results show FS3 achieves up to 29.3% improvement in energy efficiency, while frequency scaling of the LITTLE cluster reduces power but not always energy efficiency; TS policies offer limited gains, with TS3 reaching up to 17.1% in select configurations. The findings indicate that scaling the big cluster frequency provides the strongest energy-efficiency benefits and motivate further work on broader benchmarks and automatic policy selection.

Abstract

We present a family of policies that, integrated within a runtime task scheduler (Nanox), pursue the goal of improving the energy efficiency of task-parallel executions with no intervention from the programmer. The proposed policies tackle the problem by modifying the core operating frequency via DVFS mechanisms, or by enabling/disabling the mapping of tasks to specific cores at selected execution points, depending on the internal status of the scheduler. Experimental results on an asymmetric SoC (Exynos 5422) and for a specific operation (Cholesky factorization) reveal gains up to 29% in terms of energy efficiency and considerable reductions in average power.
Paper Structure (17 sections, 9 figures, 5 tables)

This paper contains 17 sections, 9 figures, 5 tables.

Figures (9)

  • Figure 1: DAG with tasks and data dependences extracted from the application of the code in Listing \ref{['lst:chol']} on a matrix with $4 \times 4$ blocks ( s=4). The labels in the nodes specify the type of kernel/task as follows: " C" for the Cholesky factorization; " T" for the triangular system solve; " G" for the matrix-matrix multiplication, and " S" for the symmetric rank- b update. The subindices (starting at 0) specify the submatrix updated by the corresponding task.
  • Figure 2: Samsung Exynos 5422 SoC employed in our experiments.
  • Figure 3: Behavior of each FS policy when is applied to a Cholesky factorization of a $1024\times 1024$ matrix divided in blocks of $64\times 64$ elements.
  • Figure 4: Policy TS2: task scheduling based on the number of ready tasks, for a Cholesky factorization of a square $4096\times4096$ elements matrix, grouped in square blocks of $512\times512$ elements each executed on an Odroid platform. Color key: red= trsm, pink= potrf, blue= syrk, green= gemm, white= idle.
  • Figure 5: Power consumption of each cluster on idle state with different number of active cores. Linux kernel does not allow switching off the whole LITTLE cluster, thus measures could not be made for this scenario.
  • ...and 4 more figures