Table of Contents
Fetching ...

Modeling Distributed Computing Infrastructures for HEP Applications

Maximilian Horzela, Henri Casanova, Manuel Giffels, Artur Gottmann, Robin Hofsaess, Günter Quast, Simone Rossi Tisbeni, Achim Streit, Frédéric Suter

TL;DR

The paper addresses the challenge of predicting performance for distributed HEP computing infrastructures like WLCG, where large-scale testing is impractical. It introduces Horzela, a C++ simulator built on SimGrid and WRENCH to model HEP workloads with data locality and caching on complex networks and storage. Calibration against CMS traces and two proof-of-concept studies demonstrate the tool’s ability to reproduce median job times and evaluate architectural options such as grid storage replacement and caching at high-latency sites. The results show that simulation is a practical, scalable approach for informing infrastructure design decisions without disrupting live operations, with clear potential to scale to the full WLCG in future work.

Abstract

Predicting the performance of various infrastructure design options in complex federated infrastructures with computing sites distributed over a wide area network that support a plethora of users and workflows, such as the Worldwide LHC Computing Grid (WLCG), is not trivial. Due to the complexity and size of these infrastructures, it is not feasible to deploy experimental test-beds at large scales merely for the purpose of comparing and evaluating alternate designs. An alternative is to study the behaviours of these systems using simulation. This approach has been used successfully in the past to identify efficient and practical infrastructure designs for High Energy Physics (HEP). A prominent example is the Monarc simulation framework, which was used to study the initial structure of the WLCG. New simulation capabilities are needed to simulate large-scale heterogeneous computing systems with complex networks, data access and caching patterns. A modern tool to simulate HEP workloads that execute on distributed computing infrastructures based on the SimGrid and WRENCH simulation frameworks is outlined. Studies of its accuracy and scalability are presented using HEP as a case-study. Hypothetical adjustments to prevailing computing architectures in HEP are studied providing insights into the dynamics of a part of the WLCG and candidates for improvements.

Modeling Distributed Computing Infrastructures for HEP Applications

TL;DR

The paper addresses the challenge of predicting performance for distributed HEP computing infrastructures like WLCG, where large-scale testing is impractical. It introduces Horzela, a C++ simulator built on SimGrid and WRENCH to model HEP workloads with data locality and caching on complex networks and storage. Calibration against CMS traces and two proof-of-concept studies demonstrate the tool’s ability to reproduce median job times and evaluate architectural options such as grid storage replacement and caching at high-latency sites. The results show that simulation is a practical, scalable approach for informing infrastructure design decisions without disrupting live operations, with clear potential to scale to the full WLCG in future work.

Abstract

Predicting the performance of various infrastructure design options in complex federated infrastructures with computing sites distributed over a wide area network that support a plethora of users and workflows, such as the Worldwide LHC Computing Grid (WLCG), is not trivial. Due to the complexity and size of these infrastructures, it is not feasible to deploy experimental test-beds at large scales merely for the purpose of comparing and evaluating alternate designs. An alternative is to study the behaviours of these systems using simulation. This approach has been used successfully in the past to identify efficient and practical infrastructure designs for High Energy Physics (HEP). A prominent example is the Monarc simulation framework, which was used to study the initial structure of the WLCG. New simulation capabilities are needed to simulate large-scale heterogeneous computing systems with complex networks, data access and caching patterns. A modern tool to simulate HEP workloads that execute on distributed computing infrastructures based on the SimGrid and WRENCH simulation frameworks is outlined. Studies of its accuracy and scalability are presented using HEP as a case-study. Hypothetical adjustments to prevailing computing architectures in HEP are studied providing insights into the dynamics of a part of the WLCG and candidates for improvements.
Paper Structure (12 sections, 3 figures, 1 table)

This paper contains 12 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Job execution times per execution machine (worker node) vs. the hitrate in the real world (left) and the simulation (right). Plots show medians and 2.5- and 97.5-percentiles of all executed jobs on each execution machine.
  • Figure 2: Simulation execution time and maximum memory consumption vs. the maximum number of active job slots, a proxy for the platform size (left); and simulated job execution times vs. hitrate with mitigated time complexity (right).
  • Figure 3: Job CPU efficiency vs. the hitrate for the scenarios in \ref{['sec:studyone']} (left) and \ref{['sec:studytwo']} (right). Results shown as standard box plots with the median, 25- and 75-quartiles, minimal and maximal single values within the interval spanned by the lower (upper) quartile minus (plus) 1.5 times the interquartile range, and outliers.