Table of Contents
Fetching ...

The integration of heterogeneous resources in the CMS Submission Infrastructure for the LHC Run 3 and beyond

Antonio Perez-Calero Yzquierdo, Marco Mascheroni, Edita Kizinevic, Farrukh Aftab Khan, Hyunwoo Kim, Maria Acosta Flechas, Nikos Tsipinakis, Saqib Haleem

TL;DR

The paper addresses integrating heterogeneous computing resources into the CMS Submission Infrastructure for LHC Run 3 and beyond, focusing on GPUs and non-x86 CPUs. It details a two-stage resource provisioning flow (GlideinWMS pilot factories followed by HTCondor matchmaking) with a CMS WM integration layer that supports GPU-specific workload attributes, enabling refined scheduling and opportunistic GPU usage. It presents practical progress, including GPU inventory and validation of Power9, ongoing ARM integration, and architecture-aware matchmaking to accommodate diverse hardware. The work highlights challenges such as standardizing heterogeneous slots, benchmarking and accounting, and enabling efficient CPU+GPU multi-step workloads, outlining a path toward fuller heterogeneous integration in the HL-LHC era.

Abstract

While the computing landscape supporting LHC experiments is currently dominated by x86 processors at WLCG sites, this configuration will evolve in the coming years. LHC collaborations will be increasingly employing HPC and Cloud facilities to process the vast amounts of data expected during the LHC Run 3 and the future HL-LHC phase. These facilities often feature diverse compute resources, including alternative CPU architectures like ARM and IBM Power, as well as a variety of GPU specifications. Using these heterogeneous resources efficiently is thus essential for the LHC collaborations reaching their future scientific goals. The Submission Infrastructure (SI) is a central element in CMS Computing, enabling resource acquisition and exploitation by CMS data processing, simulation and analysis tasks. The SI must therefore be adapted to ensure access and optimal utilization of this heterogeneous compute capacity. Some steps in this evolution have been already taken, as CMS is currently using opportunistically a small pool of GPU slots provided mainly at the CMS WLCG sites. Additionally, Power9 processors have been validated for CMS production at the Marconi-100 cluster at CINECA. This note will describe the updated capabilities of the SI to continue ensuring the efficient allocation and use of computing resources by CMS, despite their increasing diversity. The next steps towards a full integration and support of heterogeneous resources according to CMS needs will also be reported.

The integration of heterogeneous resources in the CMS Submission Infrastructure for the LHC Run 3 and beyond

TL;DR

The paper addresses integrating heterogeneous computing resources into the CMS Submission Infrastructure for LHC Run 3 and beyond, focusing on GPUs and non-x86 CPUs. It details a two-stage resource provisioning flow (GlideinWMS pilot factories followed by HTCondor matchmaking) with a CMS WM integration layer that supports GPU-specific workload attributes, enabling refined scheduling and opportunistic GPU usage. It presents practical progress, including GPU inventory and validation of Power9, ongoing ARM integration, and architecture-aware matchmaking to accommodate diverse hardware. The work highlights challenges such as standardizing heterogeneous slots, benchmarking and accounting, and enabling efficient CPU+GPU multi-step workloads, outlining a path toward fuller heterogeneous integration in the HL-LHC era.

Abstract

While the computing landscape supporting LHC experiments is currently dominated by x86 processors at WLCG sites, this configuration will evolve in the coming years. LHC collaborations will be increasingly employing HPC and Cloud facilities to process the vast amounts of data expected during the LHC Run 3 and the future HL-LHC phase. These facilities often feature diverse compute resources, including alternative CPU architectures like ARM and IBM Power, as well as a variety of GPU specifications. Using these heterogeneous resources efficiently is thus essential for the LHC collaborations reaching their future scientific goals. The Submission Infrastructure (SI) is a central element in CMS Computing, enabling resource acquisition and exploitation by CMS data processing, simulation and analysis tasks. The SI must therefore be adapted to ensure access and optimal utilization of this heterogeneous compute capacity. Some steps in this evolution have been already taken, as CMS is currently using opportunistically a small pool of GPU slots provided mainly at the CMS WLCG sites. Additionally, Power9 processors have been validated for CMS production at the Marconi-100 cluster at CINECA. This note will describe the updated capabilities of the SI to continue ensuring the efficient allocation and use of computing resources by CMS, despite their increasing diversity. The next steps towards a full integration and support of heterogeneous resources according to CMS needs will also be reported.
Paper Structure (6 sections, 3 figures)

This paper contains 6 sections, 3 figures.

Figures (3)

  • Figure 1: GlideinWMS and HTCondor components building a dynamically sized pool of compute resources for CMS SI.
  • Figure 2: GPU resources catalogue provided by the CMS SI team for CMS users, including device location and type and diverse technical properties.
  • Figure 3: Multi-architecture matchmaking in the CMS WM and SI systems.