Table of Contents
Fetching ...

Enabling Message Passing Interface Containers on the LUMI Supercomputer

Alfio Lazzaro

TL;DR

The paper tackles running MPI-based applications inside container images on the LUMI HPC system. It proposes a Hybrid model where container-built MPI libraries (e.g., MPICH/OpenMPI) are used inside containers while execution binds to the vendor-optimized HPE Cray MPI on the host, enabled by rootless proot builds due to disabled user namespaces. It provides concrete container definitions, discusses MPI ABI translation with MPIxlate, and validates correctness and performance through MPI tests and OSU benchmarks. The work enables portable, reproducible deployment of high-performance MPI workloads on LUMI and suggests reusability as base images for similar HPC environments.

Abstract

Containers represent a convenient way of packing applications with dependencies for easy user-level installation and productivity. When running on supercomputers, it becomes crucial to optimize the containers to exploit the performance optimizations provided by the system vendors. In this paper, we discuss an approach we have developed for deploying containerized applications on the LUMI supercomputer, specifically for running applications based on Message Passing Interface (MPI) parallelization. We show how users can build and run containers and get the expected performance. The proposed MPI containers can be provided on LUMI so that users can use them as base images. Although we only refer to the LUMI supercomputer, similar concepts can be applied to the case of other supercomputers.

Enabling Message Passing Interface Containers on the LUMI Supercomputer

TL;DR

The paper tackles running MPI-based applications inside container images on the LUMI HPC system. It proposes a Hybrid model where container-built MPI libraries (e.g., MPICH/OpenMPI) are used inside containers while execution binds to the vendor-optimized HPE Cray MPI on the host, enabled by rootless proot builds due to disabled user namespaces. It provides concrete container definitions, discusses MPI ABI translation with MPIxlate, and validates correctness and performance through MPI tests and OSU benchmarks. The work enables portable, reproducible deployment of high-performance MPI workloads on LUMI and suggests reusability as base images for similar HPC environments.

Abstract

Containers represent a convenient way of packing applications with dependencies for easy user-level installation and productivity. When running on supercomputers, it becomes crucial to optimize the containers to exploit the performance optimizations provided by the system vendors. In this paper, we discuss an approach we have developed for deploying containerized applications on the LUMI supercomputer, specifically for running applications based on Message Passing Interface (MPI) parallelization. We show how users can build and run containers and get the expected performance. The proposed MPI containers can be provided on LUMI so that users can use them as base images. Although we only refer to the LUMI supercomputer, similar concepts can be applied to the case of other supercomputers.
Paper Structure (6 sections, 6 figures)

This paper contains 6 sections, 6 figures.

Figures (6)

  • Figure 1: An example of how to access the proot command (version 5.4.0) on LUMI via Lmod modules, provided by LUST. The command itself is provided by the systools module, which is in this example part of the software hierarchy provided by the modules LUMI/23.09 and partition/C.
  • Figure 2: Minimal example of Singularity definition file to build MPICH compatible with the HPE Cray MPI available on LUMI. We use the Ubuntu 24.04 base image from Docker Hub. Note that we are specifically disabling the static libraries creation and the rpath of the shared libraries. Furthermore, we use the CH3 implementation, which doesn't require to install any network library interface, e.g. OFI Libfabric library. This allows binding the HPE Cray MPI when running the container.
  • Figure 3: Minimal example of Singularity definition file with OpenMPI (version 4.1.6), based on the Ubuntu 24.04 base image from Docker Hub. The openssl package, a dependency of the libopenmpi-dev, requires to run some privileged commands (groupadd, addgroup, chgrp), for which the proot's emulation of the root user doesn't work. As a workaround, we fake them with symbolic links to the true command, so that they will always return without errors, allowing the successful build of the container. A more systematic approach to fake privileged commands has been proposed in Charliecloud via the kernel's seccomp(2) system call filtering charliecloud_seccomp.
  • Figure 4: MPI test example (mpitest.c) used to check that MPI is properly running. A correct output reports the corrected number of MPI ranks and the version of the MPI implementation library.
  • Figure 5: MPI_Init intercept call (C and Fortran APIs) to be used via LD_PRELOAD. By default, the test is disabled. It can be enabled by setting the environment variable CHECK_MPI=1 (or 2 for a verbose mode). Then, the Slurm environment variable SLURM_NTASKS is compared to the number of running MPI ranks and the execution aborts if the two values differ. The test can be part of the base MPI container.
  • ...and 1 more figures