Table of Contents
Fetching ...

Software Resource Disaggregation for HPC with Serverless Computing

Marcin Copik, Marcin Chrapek, Larissa Schmid, Alexandru Calotoiu, Torsten Hoefler

TL;DR

The paper addresses the persistent resource underutilization in HPC systems by introducing software resource disaggregation through HPC-oriented Function-as-a-Service (FaaS). By adapting a high-performance serverless runtime (rFaaS) to Cray-based HPC environments, it enables fine-grained offloading of CPU, memory, and GPU resources from idle or partially allocated nodes, while maintaining near-native performance. Key contributions include a comprehensive HPC-specific FaaS design, co-location policies and memory/GPU sharing mechanisms, and integration with HPC runtimes like MPI/OpenMP. Case studies on Ault and Daint demonstrate substantial throughput gains (up to $53\%$) and reliable remote memory access up to $1\,\text{GB/s}$, illustrating a practical path to boost utilization without hardware changes.

Abstract

Aggregated HPC resources have rigid allocation systems and programming models which struggle to adapt to diverse and changing workloads. Consequently, HPC systems fail to efficiently use the large pools of unused memory and increase the utilization of idle computing resources. Prior work attempted to increase the throughput and efficiency of supercomputing systems through workload co-location and resource disaggregation. However, these methods fall short of providing a solution that can be applied to existing systems without major hardware modifications and performance losses. In this paper, we improve the utilization of supercomputers by employing the new cloud paradigm of serverless computing. We show how serverless functions provide fine-grained access to the resources of batch-managed cluster nodes. We present an HPC-oriented Function-as-a-Service (FaaS) that satisfies the requirements of high-performance applications. We demonstrate a software resource disaggregation approach where placing functions on unallocated and underutilized nodes allows idle cores and accelerators to be utilized while retaining near-native performance.

Software Resource Disaggregation for HPC with Serverless Computing

TL;DR

The paper addresses the persistent resource underutilization in HPC systems by introducing software resource disaggregation through HPC-oriented Function-as-a-Service (FaaS). By adapting a high-performance serverless runtime (rFaaS) to Cray-based HPC environments, it enables fine-grained offloading of CPU, memory, and GPU resources from idle or partially allocated nodes, while maintaining near-native performance. Key contributions include a comprehensive HPC-specific FaaS design, co-location policies and memory/GPU sharing mechanisms, and integration with HPC runtimes like MPI/OpenMP. Case studies on Ault and Daint demonstrate substantial throughput gains (up to ) and reliable remote memory access up to , illustrating a practical path to boost utilization without hardware changes.

Abstract

Aggregated HPC resources have rigid allocation systems and programming models which struggle to adapt to diverse and changing workloads. Consequently, HPC systems fail to efficiently use the large pools of unused memory and increase the utilization of idle computing resources. Prior work attempted to increase the throughput and efficiency of supercomputing systems through workload co-location and resource disaggregation. However, these methods fall short of providing a solution that can be applied to existing systems without major hardware modifications and performance losses. In this paper, we improve the utilization of supercomputers by employing the new cloud paradigm of serverless computing. We show how serverless functions provide fine-grained access to the resources of batch-managed cluster nodes. We present an HPC-oriented Function-as-a-Service (FaaS) that satisfies the requirements of high-performance applications. We demonstrate a software resource disaggregation approach where placing functions on unallocated and underutilized nodes allows idle cores and accelerators to be utilized while retaining near-native performance.
Paper Structure (31 sections, 1 equation, 17 figures, 3 tables)

This paper contains 31 sections, 1 equation, 17 figures, 3 tables.

Figures (17)

  • Figure 1: Piz Daint utilization in March 2022: querying SLURM with a two-minute interval. See Sec. \ref{['sec:background_utilization_hpc']} for details.
  • Figure 2: Software disaggregation with FaaS: increasing resource utilization without modifications to HPC hardware.
  • Figure 3: Software disaggregation: co-location provides semantics of resource disaggregation on an unmodified system.
  • Figure 4: Co-location policies use lightweight online monitoring.
  • Figure 5: Specializing serverless platform for HPC requirements.
  • ...and 12 more figures