Table of Contents
Fetching ...

Total cost of ownership and evaluation of Google cloud resources for the ATLAS experiment at the LHC

The ATLAS Collaboration

TL;DR

The paper analyzes the ATLAS Google Project as a long-term case study for using commercial cloud resources to augment ATLAS computing capacity, presenting a detailed Total Cost of Ownership framework. It demonstrates successful cloud integration via Kubernetes-based PanDA/Harvester workflows and Rucio storage elements, while revealing that network egress costs are the dominant variable driver of total cost, highly dependent on workload. A Google Cloud subscription model substantially reduced cost relative to list price, illustrating the importance of negotiated volume discounts for large-scale cloud use. The work identifies actionable paths to reduce network costs (e.g., dedicated networks and LHCONE routing), discusses procurement frameworks (OCRE), and outlines future directions to optimize data movement and exploit non-standard cloud resources, underpinning the cloud-enabled evolution of the ATLAS computing model.

Abstract

The ATLAS Google Project was established as part of an ongoing evaluation of the use of commercial clouds by the ATLAS Collaboration, in anticipation of the potential future adoption of such resources by WLCG grid sites to fulfil or complement their computing pledges. Seamless integration of Google cloud resources into the worldwide ATLAS distributed computing infrastructure was achieved at large scale and for an extended period of time, and hence cloud resources are shown to be an effective mechanism to provide additional, flexible computing capacity to ATLAS. For the first time a total cost of ownership analysis has been performed, to identify the dominant cost drivers and explore effective mechanisms for cost control. Network usage significantly impacts the costs of certain ATLAS workflows, underscoring the importance of implementing such mechanisms. Resource bursting has been successfully demonstrated, whilst exposing the true cost of this type of activity. A follow-up to the project is underway to investigate methods for improving the integration of cloud resources in data-intensive distributed computing environments and reducing costs related to network connectivity, which represents the primary expense when extensively utilising cloud resources.

Total cost of ownership and evaluation of Google cloud resources for the ATLAS experiment at the LHC

TL;DR

The paper analyzes the ATLAS Google Project as a long-term case study for using commercial cloud resources to augment ATLAS computing capacity, presenting a detailed Total Cost of Ownership framework. It demonstrates successful cloud integration via Kubernetes-based PanDA/Harvester workflows and Rucio storage elements, while revealing that network egress costs are the dominant variable driver of total cost, highly dependent on workload. A Google Cloud subscription model substantially reduced cost relative to list price, illustrating the importance of negotiated volume discounts for large-scale cloud use. The work identifies actionable paths to reduce network costs (e.g., dedicated networks and LHCONE routing), discusses procurement frameworks (OCRE), and outlines future directions to optimize data movement and exploit non-standard cloud resources, underpinning the cloud-enabled evolution of the ATLAS computing model.

Abstract

The ATLAS Google Project was established as part of an ongoing evaluation of the use of commercial clouds by the ATLAS Collaboration, in anticipation of the potential future adoption of such resources by WLCG grid sites to fulfil or complement their computing pledges. Seamless integration of Google cloud resources into the worldwide ATLAS distributed computing infrastructure was achieved at large scale and for an extended period of time, and hence cloud resources are shown to be an effective mechanism to provide additional, flexible computing capacity to ATLAS. For the first time a total cost of ownership analysis has been performed, to identify the dominant cost drivers and explore effective mechanisms for cost control. Network usage significantly impacts the costs of certain ATLAS workflows, underscoring the importance of implementing such mechanisms. Resource bursting has been successfully demonstrated, whilst exposing the true cost of this type of activity. A follow-up to the project is underway to investigate methods for improving the integration of cloud resources in data-intensive distributed computing environments and reducing costs related to network connectivity, which represents the primary expense when extensively utilising cloud resources.
Paper Structure (18 sections, 7 figures, 1 table)

This paper contains 18 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: Monitoring plots for the first six months of running at the ATLAS Google site, from July to December 2022. (a) The number of running jobs at the Google site. (b) The accumulated data at the Google RSE split into different formats, the main ones being AOD (green), RDO (blue), HITS (purple) and DAOD (yellow). (c) The daily egress traffic out of the ATLAS Google site, split into the various destination sites. (d) The monthly list--price cost per service from the Google billing console, where the six main components are shown in the legend.
  • Figure 2: Data stored (blue) and egressed for job inputs (red) at the ATLAS Google site per month from July 2022 to September 2023. The ratio is also indicated by the black line.
  • Figure 3: (a) The variation of workflows running at the ATLAS Google site from January to April 2023 featuring several periods of running with a single workflow. The contribution from User Analysis jobs can be seen from March. (b) The daily list--price cost per service from the Google billing console for the period from January to April. The dominant services are compute CPU/RAM (blue/red), local storage on the worker nodes (orange), cloud storage (purple) and network egress (green and turquoise).
  • Figure 4: Distributions covering the data reprocessing campaign performed on the Google ATLAS site. (a) Number of running jobs, where the single job type period from July 11th to July 18th shows the data jobs in yellow, together with a small number of associated merge jobs in blue. (b) The data transfers out of the ATLAS Google site for the same period to different grid sites, where the main contribution in dark purple is the replication of the reprocessing output data to CERN. This is also visible for the data reprocessing jobs after July 18th. (c) The different types of transfers out of the ATLAS Google site for the same period.
  • Figure 5: Distributions covering the resource burst tests done at the ATLAS Google site in June 2023. (a) The running jobs of the two bursts of MC Full Simulation. (b) The wall--clock consumption of the jobs running on the Google site. (c) The daily list--price cost per service from the Google billing console, where the compute contributions are seen to dominate on the burst days.
  • ...and 2 more figures