Table of Contents
Fetching ...

Monitoring the development of CFD applications on unstable HPC platforms

Damien Dosimont, Guillaume Houzeaux

TL;DR

This paper tackles the challenge of monitoring CFD applications on unstable HPC platforms by integrating performance monitoring into a CI-CD pipeline and providing a history-aware visual analytics interface. The proposed alya-cicd framework automates building, executing, and collecting rich execution metadata, while Rooster enables interactive, hierarchical analysis and cross-commit comparisons. Case studies on Alya over two years reveal platform-induced issues such as GPFS file-system pressure and suboptimal MPI-IO configurations, alongside measurable improvements from code optimizations like vectorization. The work demonstrates a practical, generalizable approach to stabilizing HPC CFD performance and offers a pathway to more reliable, scalable development on production platforms.

Abstract

We tackle the challenging tasks of monitoring on unstable HPC platforms the performance of CFD applications all along their development. We have designed and implemented a monitoring framework, integrated at the end of a CI-CD pipeline. Measures retrieved during the automatic execution of production simulations are analyzed within a visual analytics interface we developed, providing advanced visualizations and interaction. We have validated this approach by monitoring the CFD code Alya over two years, detecting and resolving issues related to the platform, and highlighting performance improvement.

Monitoring the development of CFD applications on unstable HPC platforms

TL;DR

This paper tackles the challenge of monitoring CFD applications on unstable HPC platforms by integrating performance monitoring into a CI-CD pipeline and providing a history-aware visual analytics interface. The proposed alya-cicd framework automates building, executing, and collecting rich execution metadata, while Rooster enables interactive, hierarchical analysis and cross-commit comparisons. Case studies on Alya over two years reveal platform-induced issues such as GPFS file-system pressure and suboptimal MPI-IO configurations, alongside measurable improvements from code optimizations like vectorization. The work demonstrates a practical, generalizable approach to stabilizing HPC CFD performance and offers a pathway to more reliable, scalable development on production platforms.

Abstract

We tackle the challenging tasks of monitoring on unstable HPC platforms the performance of CFD applications all along their development. We have designed and implemented a monitoring framework, integrated at the end of a CI-CD pipeline. Measures retrieved during the automatic execution of production simulations are analyzed within a visual analytics interface we developed, providing advanced visualizations and interaction. We have validated this approach by monitoring the CFD code Alya over two years, detecting and resolving issues related to the platform, and highlighting performance improvement.
Paper Structure (7 sections, 3 figures)

This paper contains 7 sections, 3 figures.

Figures (3)

  • Figure 1: Stack bar chart showing the time passed in the different operation types. The 6$^{th}$ simulation iteration is affected by longer IO
  • Figure 2: Simulation time of CambSprayH1S1, decreasing after November 2021
  • Figure 3: Sunbursts of the cough simulation showing the execution time improvement thanks to the vectorization of the velocity correction