Table of Contents
Fetching ...

Reproducible Cross-border High Performance Computing for Scientific Portals

Kessy Abarenkov, Anne Fouilloux, Helmut Neukirchen, Abdulrahman Azab

TL;DR

The paper tackles reproducibility challenges in cross-border eScience by integrating containerized software packaging with web portals (Galaxy and PlutoF) to enable automated workflows across heterogeneous HPC resources. It demonstrates two pilots, Biodiversity and Climate, detailing technical solutions (front-ends, packaging with Singularity/conda wrappers, and setup automation) and governance issues (robot accounts, authentication, quotas). The work contributes a generic, portal-agnostic approach to coupling community portals with remote compute and data resources, enhanced by automated setup and reproducible environments to achieve bit-for-bit reproducibility when feasible. Its significance lies in enabling scalable, cross-border scientific computations within EOSC-Nordic, improving FAIR data handling, and guiding policy and sustainability considerations for long-term cross-border HPC access.

Abstract

To reproduce eScience, several challenges need to be solved: scientific workflows need to be automated; the involved software versions need to be provided in an unambiguous way; input data needs to be easily accessible; High-Performance Computing (HPC) clusters are often involved and to achieve bit-to-bit reproducibility, it might be even necessary to execute the code on a particular cluster to avoid differences caused by different HPC platforms (and unless this is a scientist's local cluster, it needs to be accessed across (administrative) borders). Preferably, to allow even inexperienced users to (re-)produce results, all should be user-friendly. While some easy-to-use web-based scientific portals support already to access HPC resources, this typically only refers to computing and data resources that are local. By the example of two community-specific portals in the fields of biodiversity and climate research, we present a solution for accessing remote HPC (and cloud) compute and data resources from scientific portals across borders, involving rigorous container-based packaging of the software version and setup automation, thus enhancing reproducibility.

Reproducible Cross-border High Performance Computing for Scientific Portals

TL;DR

The paper tackles reproducibility challenges in cross-border eScience by integrating containerized software packaging with web portals (Galaxy and PlutoF) to enable automated workflows across heterogeneous HPC resources. It demonstrates two pilots, Biodiversity and Climate, detailing technical solutions (front-ends, packaging with Singularity/conda wrappers, and setup automation) and governance issues (robot accounts, authentication, quotas). The work contributes a generic, portal-agnostic approach to coupling community portals with remote compute and data resources, enhanced by automated setup and reproducible environments to achieve bit-for-bit reproducibility when feasible. Its significance lies in enabling scalable, cross-border scientific computations within EOSC-Nordic, improving FAIR data handling, and guiding policy and sustainability considerations for long-term cross-border HPC access.

Abstract

To reproduce eScience, several challenges need to be solved: scientific workflows need to be automated; the involved software versions need to be provided in an unambiguous way; input data needs to be easily accessible; High-Performance Computing (HPC) clusters are often involved and to achieve bit-to-bit reproducibility, it might be even necessary to execute the code on a particular cluster to avoid differences caused by different HPC platforms (and unless this is a scientist's local cluster, it needs to be accessed across (administrative) borders). Preferably, to allow even inexperienced users to (re-)produce results, all should be user-friendly. While some easy-to-use web-based scientific portals support already to access HPC resources, this typically only refers to computing and data resources that are local. By the example of two community-specific portals in the fields of biodiversity and climate research, we present a solution for accessing remote HPC (and cloud) compute and data resources from scientific portals across borders, involving rigorous container-based packaging of the software version and setup automation, thus enhancing reproducibility.
Paper Structure (20 sections, 1 figure, 1 table)

This paper contains 20 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: Overview of the different components of Galaxy Climate: i) left-hand side: Galaxy Climate front-end; ii) center: Remote compute resources are added (new Pulsar node) and are selected depending on their availability and tool requirements, e.g. GPUs, memory, etc. iii) right-hand side: Object Storage end-points to access data remotely and independently of their physical locations.