Table of Contents
Fetching ...

HPC Containers for EBRAINS: Towards Portable Cross-Domain Software Environment

Krishna Kant Singh, Eric Müller, Eleni Mathioulaki, Wouter Klijn, Lena Oden

Abstract

Deploying complex, distributed scientific workflows across diverse HPC sites is often hindered by site-specific dependencies and complex build environments. This paper investigates the design and performance of portable HPC container images capable of encapsulating MPI- and CUDA-enabled software stacks without sacrificing bare-metal performance. This work is part of recent work performed within the EBRAINS Research Infrastructure, to evaluate the implementation of portable HPC (Apptainer-based) container images targeting the EBRAINS Software Distribution (ESD) -- a Spack-based software ecosystem comprising approximately 80 top-level packages (and 800 dependencies). We evaluate a hybrid, PMIx-based containerization strategy using Apptainer that seamlessly bypasses the need for site-specific builds by dynamically leveraging host-level specialized hardware, such as network interfaces and GPUs, on two production HPC clusters: Karolina and Jureca-DC. We demonstrate the feasibility of building portable, MPI- and CUDA-enabled scientific software into container images that correctly leverage site-installed drivers and hardware to reproduce bare-metal communication behavior. Using communication microbenchmarks (e.g., OSU and NCCL) alongside performance metrics of applications from neuroscience, we measure and verify their performance against bare-metal deployments. Crucially, our verification approach extends beyond top-level runtime measurements; we highlight the analysis of underlying debug logs to actively detect misbehavior and misconfigurations, such as suboptimal transport pathways. Ultimately, this investigation demonstrates the feasibility of a simple and reproducible methodology for decoupling software environments from underlying infrastructures, paving the way for automated pipelines that ensure optimized, performance-verified execution across varied HPC architectures.

HPC Containers for EBRAINS: Towards Portable Cross-Domain Software Environment

Abstract

Deploying complex, distributed scientific workflows across diverse HPC sites is often hindered by site-specific dependencies and complex build environments. This paper investigates the design and performance of portable HPC container images capable of encapsulating MPI- and CUDA-enabled software stacks without sacrificing bare-metal performance. This work is part of recent work performed within the EBRAINS Research Infrastructure, to evaluate the implementation of portable HPC (Apptainer-based) container images targeting the EBRAINS Software Distribution (ESD) -- a Spack-based software ecosystem comprising approximately 80 top-level packages (and 800 dependencies). We evaluate a hybrid, PMIx-based containerization strategy using Apptainer that seamlessly bypasses the need for site-specific builds by dynamically leveraging host-level specialized hardware, such as network interfaces and GPUs, on two production HPC clusters: Karolina and Jureca-DC. We demonstrate the feasibility of building portable, MPI- and CUDA-enabled scientific software into container images that correctly leverage site-installed drivers and hardware to reproduce bare-metal communication behavior. Using communication microbenchmarks (e.g., OSU and NCCL) alongside performance metrics of applications from neuroscience, we measure and verify their performance against bare-metal deployments. Crucially, our verification approach extends beyond top-level runtime measurements; we highlight the analysis of underlying debug logs to actively detect misbehavior and misconfigurations, such as suboptimal transport pathways. Ultimately, this investigation demonstrates the feasibility of a simple and reproducible methodology for decoupling software environments from underlying infrastructures, paving the way for automated pipelines that ensure optimized, performance-verified execution across varied HPC architectures.
Paper Structure (40 sections, 11 figures, 1 table)

This paper contains 40 sections, 11 figures, 1 table.

Figures (11)

  • Figure 1: osu_init benchmark results comparing MPI initialization time for native and Apptainer container execution on Karolina (a) and JURECA (b). Error bars represent the minimum and maximum observed initialization times across all runs. On Karolina, the container incurs a consistently higher initialization overhead than native execution, with the gap widening at 256 nodes. On JURECA, the container exhibits significantly lower initialization times across all node counts, suggesting a leaner MPI bootstrap path within the containerized environment. Lower is better.
  • Figure 2: OSU point-to-point latency (osu_latency) for intra-node communication (1 node, 2 MPI tasks on the same node) on (a) Karolina and (b) JURECA. Latency is plotted as a function of message size for Apptainer (container) and native bare-metal execution. Both axes are logarithmic. Lower is better.
  • Figure 3: OSU point-to-point latency (osu_latency) for inter-node communication (2 nodes, 1 MPI task per node) on (a) Karolina and (b) JURECA. Latency is plotted as a function of message size for Apptainer (container) and native bare-metal execution. Both axes are logarithmic. Lower is better.
  • Figure 4: NCCL AllReduce bus bandwidth as a function of message size for a single-node configuration, comparing native execution and Apptainer containerisation. (a) Karolina: 8 GPUs interconnected via NV12 NVLink bonds, peak bus bandwidth $\approx$ 225Gs. (b) JURECA: 4 GPUs interconnected via NV4 NVLink bonds, peak bus bandwidth $\approx$ 225Gs. On both systems, native and Apptainer results are indistinguishable across all message sizes, with peak bus bandwidth deviating by at most 1.3%. Markers show the mean of two benchmark runs; error bars indicate the half-range between runs. Higher is better.
  • Figure 5: NCCL AllReduce bus bandwidth as a function of message size for a two-node configuration, comparing native execution and Apptainer containerisation. (a) Karolina: peak inter-node bus bandwidth 92.5Gs, sustained by 4 InfiniBand NICs per node each connected to a dedicated GPU pair. (b) JURECA: peak inter-node bus bandwidth 49.0Gs, limited by only 2 NICs per node. The $\approx$ 2 bandwidth difference between the systems reflects their NIC-to-GPU topology rather than any container effect: native and Apptainer results agree to within 0.09% on Karolina and 0.01% on JURECA across all message sizes. Higher is better.
  • ...and 6 more figures