Table of Contents
Fetching ...

cfdSCOPE: A Fluid-Dynamics Proxy App for Teaching Performance Engineering

Peter Arzt, Sebastian Kreutzer, Tim Jammer, Christian Bischof

TL;DR

cfdSCOPE provides an accessible HPC teaching tool by offering a compact, open-source proxy that models incompressible flow via a lid-driven cavity benchmark. The proxy is implemented in a self-contained C++ codebase employing a staggered-grid discretization and a PCG solver on a CSR matrix, with a deliberately simple OpenMP-based parallelism to reveal optimization opportunities. The authors demonstrate substantial runtime improvements through a structured optimization workflow, reporting a 76% speedup and detailed analyses of strong scaling and roofline performance. This work delivers a practical platform for course-based learning in performance engineering and presents concrete, transferable optimization techniques such as memory reuse, simplified preconditioning, in-place computation, and improved memory access, with potential to extend to distributed-memory architectures in the future.

Abstract

Teaching performance engineering in high-performance computing (HPC) requires example codes that demonstrate bottlenecks and enable hands-on optimization. However, existing HPC applications and proxy apps often lack the balance of simplicity, transparency, and optimization potential needed for effective teaching. To address this, we developed cfdSCOPE, a compact, open-source computational fluid dynamics (CFD) proxy app specifically designed for educational purposes. cfdSCOPE simulates flow in a 3D volume using sparse linear algebra, a common HPC workload, and comprises fewer than 1,100 lines of code. Its minimal dependencies and transparent design ensure students can fully control and optimize performance-critical aspects, while its naive OpenMP parallelization provides significant optimization opportunities, thus making it an ideal tool for teaching performance engineering.

cfdSCOPE: A Fluid-Dynamics Proxy App for Teaching Performance Engineering

TL;DR

cfdSCOPE provides an accessible HPC teaching tool by offering a compact, open-source proxy that models incompressible flow via a lid-driven cavity benchmark. The proxy is implemented in a self-contained C++ codebase employing a staggered-grid discretization and a PCG solver on a CSR matrix, with a deliberately simple OpenMP-based parallelism to reveal optimization opportunities. The authors demonstrate substantial runtime improvements through a structured optimization workflow, reporting a 76% speedup and detailed analyses of strong scaling and roofline performance. This work delivers a practical platform for course-based learning in performance engineering and presents concrete, transferable optimization techniques such as memory reuse, simplified preconditioning, in-place computation, and improved memory access, with potential to extend to distributed-memory architectures in the future.

Abstract

Teaching performance engineering in high-performance computing (HPC) requires example codes that demonstrate bottlenecks and enable hands-on optimization. However, existing HPC applications and proxy apps often lack the balance of simplicity, transparency, and optimization potential needed for effective teaching. To address this, we developed cfdSCOPE, a compact, open-source computational fluid dynamics (CFD) proxy app specifically designed for educational purposes. cfdSCOPE simulates flow in a 3D volume using sparse linear algebra, a common HPC workload, and comprises fewer than 1,100 lines of code. Its minimal dependencies and transparent design ensure students can fully control and optimize performance-critical aspects, while its naive OpenMP parallelization provides significant optimization opportunities, thus making it an ideal tool for teaching performance engineering.

Paper Structure

This paper contains 14 sections, 2 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Visualization of the velocity field at different time steps, sliced along the x-y plane. The top row of figures displays the direction of flow at each point while the bottom row shows streamline plots.
  • Figure 2: Strong scaling plot for the unoptimized and optimized versions of cfdSCOPE, depicting the runtime of the simulation and some a selection if its functions for different numbers of OpenMP threads. Both axes are logarithmic.
  • Figure 3: Roofline model for selected kernels. The sloped red and horizontal blue lines depict the achievable performance as a function of a kernel's arithmetic intensity, consisting of the memory bandwidth and maximum compute bandwidth $R_{peak}$, respectively. Limits have been measured using 32 CPU cores. Both axes are logarithmic. For each kernel, the horizontal dash describes the performance of the unoptimized version, while the connected marker represents the optimized version. The entry with the star marker shows the performance of vector addition/multiplication (operator+, operator*) in the unoptimized version and multiply_add_inplace in the optimized version.
  • Figure 4: Comparison of traces before and after optimization. Shown is a single time step computed with 8 threads. Note that the time scales differ between the two traces.