An Incremental Multi-Level, Multi-Scale Approach to Assessment of Multifidelity HPC Systems
Shilpika Shilpika, Bethany Lusch, Venkatram Vishwanath, Michael E. Papka
TL;DR
The paper tackles the challenge of real-time analysis of terabyte-scale, multifidelity HPC logs by introducing an online, incremental multiresolution dynamic mode decomposition (I-mrDMD). It combines a streaming mrDMD framework with frequency-aware spectral isolation and a D3-based rack visualization in Jupyter to align environment, hardware, and job logs across multiple scales. The key contributions include the incremental SVD-based updates for multi-level mrDMD, a spectral mechanism for isolating significant modes, and an interactive visualization pipeline demonstrated on the Theta supercomputer with two case studies. This approach enables fast, scalable pattern discovery and anomaly detection in large-scale HPC systems, with practical implications for resilience, debugging, and efficient resource use, and is extensible to exascale and zettascale environments.
Abstract
With the growing complexity in architecture and the size of large-scale computing systems, monitoring and analyzing system behavior and events has become daunting. Monitoring data amounting to terabytes per day are collected by sensors housed in these massive systems at multiple fidelity levels and varying temporal resolutions. In this work, we develop an incremental version of multiresolution dynamic mode decomposition (mrDMD), which converts high-dimensional data to spatial-temporal patterns at varied frequency ranges. Our incremental implementation of the mrDMD algorithm (I-mrDMD) promptly reveals valuable information in the massive environment log dataset, which is then visually aligned with the processed hardware and job log datasets through our generalizable rack visualization using D3 visualization integrated into the Jupyter Notebook interface. We demonstrate the efficacy of our approach with two use scenarios on a real-world dataset from a Cray XC40 supercomputer, Theta.
