Introducing JIRIAF: A Virtual Kubelet Integration for Optimizing HPC Resource Provisioning
Vardan Gyurjyan, Graham Heyes, Christopher Larrieu, David Lawrence, Jeng-Yuan Tsai
TL;DR
The paper presents JIRIAF, a framework for optimizing HPC resource provisioning across heterogeneous facilities by integrating Kubernetes with a Virtual Kubelet-based JIRIAF Resource Manager (JRM) that runs in userspace. It details the VK-Cmd implementation, pod lifecycle management, and HPA support, and demonstrates a proof-of-concept deployment on NERSC’s Perlmutter system for data-stream pipelines in CLAS12 ERSAP workloads. The work also introduces a Dynamic Bayesian Network-based digital twin to model and control a simulated queue for real-time monitoring, alongside Prometheus-based monitoring and FireWorks-driven JRM deployment. Together, these components illustrate a scalable, container-centric approach to distributed HPC resource management with practical deployment guidance and performance evaluation for real-world workloads.
Abstract
The JIRIAF (JLab Integrated Research Infrastructure Across Facilities) framework is designed to streamline resource management and optimize high-performance computing (HPC) workloads across heterogeneous environments. Central to JIRIAF is the JIRIAF Resource Manager (JRM), which effectively leverages Kubernetes and Virtual Kubelet to manage resources dynamically, even in environments with restricted user privileges. By operating in userspace, JRM facilitates the execution of user applications as containers across diverse computing sites, ensuring unified control and monitoring. The framework's effectiveness is demonstrated through a case study involving the deployment of data-stream processing pipelines on the Perlmutter system at NERSC, showcasing its capability to manage large-scale HPC applications efficiently. Additionally, we discuss the integration of a digital twin model for a simulated queue system related to a streaming system, using a Dynamic Bayesian Network (DBN) to enhance real-time monitoring and control, providing valuable insights into system performance and optimization strategies.
