Modeling Distributed Computing Infrastructures for HEP Applications
Maximilian Horzela, Henri Casanova, Manuel Giffels, Artur Gottmann, Robin Hofsaess, Günter Quast, Simone Rossi Tisbeni, Achim Streit, Frédéric Suter
TL;DR
The paper addresses the challenge of predicting performance for distributed HEP computing infrastructures like WLCG, where large-scale testing is impractical. It introduces Horzela, a C++ simulator built on SimGrid and WRENCH to model HEP workloads with data locality and caching on complex networks and storage. Calibration against CMS traces and two proof-of-concept studies demonstrate the tool’s ability to reproduce median job times and evaluate architectural options such as grid storage replacement and caching at high-latency sites. The results show that simulation is a practical, scalable approach for informing infrastructure design decisions without disrupting live operations, with clear potential to scale to the full WLCG in future work.
Abstract
Predicting the performance of various infrastructure design options in complex federated infrastructures with computing sites distributed over a wide area network that support a plethora of users and workflows, such as the Worldwide LHC Computing Grid (WLCG), is not trivial. Due to the complexity and size of these infrastructures, it is not feasible to deploy experimental test-beds at large scales merely for the purpose of comparing and evaluating alternate designs. An alternative is to study the behaviours of these systems using simulation. This approach has been used successfully in the past to identify efficient and practical infrastructure designs for High Energy Physics (HEP). A prominent example is the Monarc simulation framework, which was used to study the initial structure of the WLCG. New simulation capabilities are needed to simulate large-scale heterogeneous computing systems with complex networks, data access and caching patterns. A modern tool to simulate HEP workloads that execute on distributed computing infrastructures based on the SimGrid and WRENCH simulation frameworks is outlined. Studies of its accuracy and scalability are presented using HEP as a case-study. Hypothetical adjustments to prevailing computing architectures in HEP are studied providing insights into the dynamics of a part of the WLCG and candidates for improvements.
