Table of Contents
Fetching ...

Repurposing of the Run 2 CMS High Level Trigger Infrastructure as a Cloud Resource for Offline Computing

Marco Mascheroni, Antonio Perez-Calero Yzquierdo, Edita Kizinevic, Farrukh Aftab Khan, Hyunwoo Kim, Maria Acosta Flechas, Nikos Tsipinakis, Saqib Haleem, Damiele Spiga, Christoph Wissing, Frank Wurthwein

TL;DR

The paper addresses repurposing CMS Run 2 HLT infrastructure into an on-site offline cloud resource at LHC Point 5 to support Tier 0 prompt reconstruction and offline workloads. It introduces the Vacuum Model Glideins approach, featuring a new site endpoint and glidein-launcher service that enables direct VM-launched pilots, resource validation, and defragmentation management. The authors report commissioning Tier 0 tasks on Run 3, including defragmentation to provide 8-core slots, and demonstrate rapid Tier 0 core allocation on T2_CH_CERN_P5, validating the operational gains of the new model. Conclusions highlight improved resilience and automated pilot reconfiguration, with future plans to extend to Run 3 HLT farm, pursue Interfill mode, and enable GPU virtualization pending configuration.

Abstract

The former CMS Run 2 High Level Trigger (HLT) farm is one of the largest contributors to CMS compute resources, providing about 25k job slots for offline computing. This CPU farm was initially employed as an opportunistic resource, exploited during inter-fill periods, in the LHC Run 2. Since then, it has become a nearly transparent extension of the CMS capacity at CERN, being located on-site at the LHC interaction point 5 (P5), where the CMS detector is installed. This resource has been configured to support the execution of critical CMS tasks, such as prompt detector data reconstruction. It can therefore be used in combination with the dedicated Tier 0 capacity at CERN, in order to process and absorb peaks in the stream of data coming from the CMS detector. The initial configuration for this resource, based on statically configured VMs, provided the required level of functionality. However, regular operations of this cluster revealed certain limitations compared to the resource provisioning and use model employed in the case of WLCG sites. A new configuration, based on a vacuum-like model, has been implemented for this resource in order to solve the detected shortcomings. This paper reports about this redeployment work on the permanent cloud for an enhanced support to CMS offline computing, comparing the former and new models' respective functionalities, along with the commissioning effort for the new setup.

Repurposing of the Run 2 CMS High Level Trigger Infrastructure as a Cloud Resource for Offline Computing

TL;DR

The paper addresses repurposing CMS Run 2 HLT infrastructure into an on-site offline cloud resource at LHC Point 5 to support Tier 0 prompt reconstruction and offline workloads. It introduces the Vacuum Model Glideins approach, featuring a new site endpoint and glidein-launcher service that enables direct VM-launched pilots, resource validation, and defragmentation management. The authors report commissioning Tier 0 tasks on Run 3, including defragmentation to provide 8-core slots, and demonstrate rapid Tier 0 core allocation on T2_CH_CERN_P5, validating the operational gains of the new model. Conclusions highlight improved resilience and automated pilot reconfiguration, with future plans to extend to Run 3 HLT farm, pursue Interfill mode, and enable GPU virtualization pending configuration.

Abstract

The former CMS Run 2 High Level Trigger (HLT) farm is one of the largest contributors to CMS compute resources, providing about 25k job slots for offline computing. This CPU farm was initially employed as an opportunistic resource, exploited during inter-fill periods, in the LHC Run 2. Since then, it has become a nearly transparent extension of the CMS capacity at CERN, being located on-site at the LHC interaction point 5 (P5), where the CMS detector is installed. This resource has been configured to support the execution of critical CMS tasks, such as prompt detector data reconstruction. It can therefore be used in combination with the dedicated Tier 0 capacity at CERN, in order to process and absorb peaks in the stream of data coming from the CMS detector. The initial configuration for this resource, based on statically configured VMs, provided the required level of functionality. However, regular operations of this cluster revealed certain limitations compared to the resource provisioning and use model employed in the case of WLCG sites. A new configuration, based on a vacuum-like model, has been implemented for this resource in order to solve the detected shortcomings. This paper reports about this redeployment work on the permanent cloud for an enhanced support to CMS offline computing, comparing the former and new models' respective functionalities, along with the commissioning effort for the new setup.
Paper Structure (3 sections, 4 figures)

This paper contains 3 sections, 4 figures.

Figures (4)

  • Figure 1: Schematic view of the new HLT farm deployment model, based on vacuum-like pilots.
  • Figure 2: Transition period for the deployment model of the HLT resources, based on vacuum-like pilots.
  • Figure 3: Transition period for the deployment model of the HLT resources, based on vacuum-like pilots.
  • Figure 4: Transition period for the deployment model of the HLT resources, based on vacuum-like pilots.