Table of Contents
Fetching ...

The National Research Platform: Stretched, Multi-Tenant, Scientific Kubernetes Cluster

Derek Weitzel, Ashton Graves, Sam Albin, Huijun Zhu, Frank Würthwein, Mahidhar Tatineni, Dmitry Mishin, John Graham, Elham E Khoda, Mohammad Firas Sada, Larry Smarr, Thomas DeFanti

TL;DR

The National Research Platform (NRP) tackles the challenge of scalable, distributed, multi-tenant scientific computing by deploying a Kubernetes-based federation that unites heterogeneous compute and storage across 75+ sites. It presents end-to-end infrastructure—automated Ansible deployment, IPMI and Kubernetes integration, NetBox inventory, Admiralty federation, Ceph storage, and user-facing interfaces (JupyterHub, Coder) with security (Falco) and accounting (ObservableHQ with caching)—to support AI/ML workloads and hosted services such as LLMs. The paper reports substantial scale (1,400+ GPUs, 28,000 CPUs, 161 TB RAM across 420+ nodes), improved GPU utilization due to a reservation system, and growth in user communities, including educational institutions, highlighting practical pathways for distributed HPC ecosystems. Collectively, these contributions provide a concrete blueprint for building and operating large-scale, policy-driven, multi-institution cyberinfrastructures that facilitate broad access to advanced scientific computing and workforce development.

Abstract

The National Research Platform (NRP) represents a distributed, multi-tenant Kubernetes-based cyberinfrastructure designed to facilitate collaborative scientific computing. Spanning over 75 locations in the U.S. and internationally, the NRP uniquely integrates varied computational resources, ranging from single nodes to extensive GPU and CPU clusters, to support diverse research workloads including advanced AI and machine learning tasks. It emphasizes flexibility through user-friendly interfaces such as JupyterHub and low level control of resources through direct Kubernetes interaction. Critical operational insights are discussed, including security enhancements using Kubernetes-integrated threat detection, extensive monitoring, and comprehensive accounting systems. This paper highlights the NRP's growing importance and scalability in addressing the increasing demands for distributed scientific computational resources.

The National Research Platform: Stretched, Multi-Tenant, Scientific Kubernetes Cluster

TL;DR

The National Research Platform (NRP) tackles the challenge of scalable, distributed, multi-tenant scientific computing by deploying a Kubernetes-based federation that unites heterogeneous compute and storage across 75+ sites. It presents end-to-end infrastructure—automated Ansible deployment, IPMI and Kubernetes integration, NetBox inventory, Admiralty federation, Ceph storage, and user-facing interfaces (JupyterHub, Coder) with security (Falco) and accounting (ObservableHQ with caching)—to support AI/ML workloads and hosted services such as LLMs. The paper reports substantial scale (1,400+ GPUs, 28,000 CPUs, 161 TB RAM across 420+ nodes), improved GPU utilization due to a reservation system, and growth in user communities, including educational institutions, highlighting practical pathways for distributed HPC ecosystems. Collectively, these contributions provide a concrete blueprint for building and operating large-scale, policy-driven, multi-institution cyberinfrastructures that facilitate broad access to advanced scientific computing and workforce development.

Abstract

The National Research Platform (NRP) represents a distributed, multi-tenant Kubernetes-based cyberinfrastructure designed to facilitate collaborative scientific computing. Spanning over 75 locations in the U.S. and internationally, the NRP uniquely integrates varied computational resources, ranging from single nodes to extensive GPU and CPU clusters, to support diverse research workloads including advanced AI and machine learning tasks. It emphasizes flexibility through user-friendly interfaces such as JupyterHub and low level control of resources through direct Kubernetes interaction. Critical operational insights are discussed, including security enhancements using Kubernetes-integrated threat detection, extensive monitoring, and comprehensive accounting systems. This paper highlights the NRP's growing importance and scalability in addressing the increasing demands for distributed scientific computational resources.

Paper Structure

This paper contains 8 sections, 1 figure.

Figures (1)

  • Figure 1: GPU hours by research group on the NRP over time showing growth of GPU use. For the 12 month period, the total is 6,605,042 GPU hours from 412 research groups.