Table of Contents
Fetching ...

Scalable ATLAS pMSSM computational workflows using containerised REANA reusable analysis platform

Marco Donadoni, Matthew Feickert, Lukas Heinrich, Yang Liu, Audrius Mečionis, Vladyslav Moisieienkov, Tibor Šimko, Giordon Stark, Marco Vidal García

TL;DR

The development of a streamlined framework for large-scale ATLAS pMSSM reinterpretations of LHC Run-2 analyses using containerised computational workflows using the REANA reusable analysis platform is described.

Abstract

In this paper we describe the development of a streamlined framework for large-scale ATLAS pMSSM reinterpretations of LHC Run-2 analyses using containerised computational workflows. The project is looking to assess the global coverage of BSM physics and requires running O(5k) computational workflows representing pMSSM model points. Following ATLAS Analysis Preservation policies, many analyses have been preserved as containerised Yadage workflows, and after validation were added to a curated selection for the pMSSM study. To run the workflows at scale, we utilised the REANA reusable analysis platform. We describe how the REANA platform was enhanced to ensure the best concurrent throughput by internal service scheduling changes. We discuss the scalability of the approach on Kubernetes clusters from 500 to 5000 cores. Finally, we demonstrate a possibility of using additional ad-hoc public cloud infrastructure resources by running the same workflows on the Google Cloud Platform.

Scalable ATLAS pMSSM computational workflows using containerised REANA reusable analysis platform

TL;DR

The development of a streamlined framework for large-scale ATLAS pMSSM reinterpretations of LHC Run-2 analyses using containerised computational workflows using the REANA reusable analysis platform is described.

Abstract

In this paper we describe the development of a streamlined framework for large-scale ATLAS pMSSM reinterpretations of LHC Run-2 analyses using containerised computational workflows. The project is looking to assess the global coverage of BSM physics and requires running O(5k) computational workflows representing pMSSM model points. Following ATLAS Analysis Preservation policies, many analyses have been preserved as containerised Yadage workflows, and after validation were added to a curated selection for the pMSSM study. To run the workflows at scale, we utilised the REANA reusable analysis platform. We describe how the REANA platform was enhanced to ensure the best concurrent throughput by internal service scheduling changes. We discuss the scalability of the approach on Kubernetes clusters from 500 to 5000 cores. Finally, we demonstrate a possibility of using additional ad-hoc public cloud infrastructure resources by running the same workflows on the Google Cloud Platform.
Paper Structure (4 sections, 9 figures)

This paper contains 4 sections, 9 figures.

Figures (9)

  • Figure 1: A screenshot of the ATLAS SUSY group analyses preserved on GitLab. Each repository is labeled with the internal ATLAS analysis identifier and contains both workflow files and additional data files needed for the computational processing.
  • Figure 2: A typical pMSSM workflow. The computational runtime is about 10 minutes without systematics (test payload) and about 10 hours with all systematics (real payload).
  • Figure 3: The sequence diagram showing how REANA schedules incoming workflows after submission. The submitted workflows are announced via message queue that is later processed by the workflow scheduler in Figure \ref{['fig:reanascheduler2']}.
  • Figure 4: The sequence diagram showing how REANA schedules queued workflows. The system checks for available resources before allowing workflow runs for execution. The checking and rescheduling workflow offers several possibilities for optimisations. The workflows accepted for execution are further processed in Figure \ref{['fig:reanascheduler3']}.
  • Figure 5: The sequence diagram showing how the REANA executes scheduled workflows. Note the interplay between the scheduler and the Kubernetes cluster. The pod creation offers another space for optimisations. The workflow execution status monitoring is carried out by a watching loop. The workflow jobs are started for each workflow step. The termination procedures are further illustrated in Figure \ref{['fig:reanascheduler4']}.
  • ...and 4 more figures