Monitoring the development of CFD applications on unstable HPC platforms
Damien Dosimont, Guillaume Houzeaux
TL;DR
This paper tackles the challenge of monitoring CFD applications on unstable HPC platforms by integrating performance monitoring into a CI-CD pipeline and providing a history-aware visual analytics interface. The proposed alya-cicd framework automates building, executing, and collecting rich execution metadata, while Rooster enables interactive, hierarchical analysis and cross-commit comparisons. Case studies on Alya over two years reveal platform-induced issues such as GPFS file-system pressure and suboptimal MPI-IO configurations, alongside measurable improvements from code optimizations like vectorization. The work demonstrates a practical, generalizable approach to stabilizing HPC CFD performance and offers a pathway to more reliable, scalable development on production platforms.
Abstract
We tackle the challenging tasks of monitoring on unstable HPC platforms the performance of CFD applications all along their development. We have designed and implemented a monitoring framework, integrated at the end of a CI-CD pipeline. Measures retrieved during the automatic execution of production simulations are analyzed within a visual analytics interface we developed, providing advanced visualizations and interaction. We have validated this approach by monitoring the CFD code Alya over two years, detecting and resolving issues related to the platform, and highlighting performance improvement.
