Table of Contents
Fetching ...

Automated, Reliable, and Efficient Continental-Scale Replication of 7.3 Petabytes of Climate Simulation Data: A Case Study

Lukasz Lacinski, Lee Liming, Steven Turoscy, Cameron Harr, Kyle Chard, Eli Dart, Paul Durack, Sasha Ames, Forrest M. Hoffman, Ian T. Foster

TL;DR

The paper tackles the challenge of duplicating hundreds of terabytes of CMIP/ESGF climate data across major U.S. sites to improve reliability and accessibility. It describes a largely automated replication workflow built on Globus Transfer and a custom controller that coordinates 2×2291 path transfers LLNL→ALCF and LLNL→OLCF, with failover routing during maintenance. Results show ~7.3 PB moved to each destination over ~77 days at average rates near 1.5 GB/s per link and peaks above 7.5 GB/s, despite LLNL GPFS bottlenecks. The work yields actionable lessons for large-scale data distribution and underlines the value of an ESGF-wide data replication fabric for CMIP7 and beyond.

Abstract

We report on our experiences replicating 7.3 petabytes (PB) of Earth System Grid Federation (ESGF) climate simulation data from Lawrence Livermore National Laboratory (LLNL) in California to Argonne National Laboratory (ANL) in Illinois and Oak Ridge National Laboratory (ORNL) in Tennessee. This movement of some 29 million files, twice, undertaken in order to establish new ESGF nodes at ANL and ORNL, was performed largely automatically by a simple replication tool, a script that invoked Globus to transfer large bundles of files while tracking progress in a database. Under the covers, Globus organized transfers to make efficient use of the high-speed Energy Sciences network (ESnet) and the data transfer nodes deployed at participating sites, and also addressed security, integrity checking, and recovery from a variety of transient failures. This success demonstrates the considerable benefits that can accrue from the adoption of performant data replication infrastructure.

Automated, Reliable, and Efficient Continental-Scale Replication of 7.3 Petabytes of Climate Simulation Data: A Case Study

TL;DR

The paper tackles the challenge of duplicating hundreds of terabytes of CMIP/ESGF climate data across major U.S. sites to improve reliability and accessibility. It describes a largely automated replication workflow built on Globus Transfer and a custom controller that coordinates 2×2291 path transfers LLNL→ALCF and LLNL→OLCF, with failover routing during maintenance. Results show ~7.3 PB moved to each destination over ~77 days at average rates near 1.5 GB/s per link and peaks above 7.5 GB/s, despite LLNL GPFS bottlenecks. The work yields actionable lessons for large-scale data distribution and underlines the value of an ESGF-wide data replication fabric for CMIP7 and beyond.

Abstract

We report on our experiences replicating 7.3 petabytes (PB) of Earth System Grid Federation (ESGF) climate simulation data from Lawrence Livermore National Laboratory (LLNL) in California to Argonne National Laboratory (ANL) in Illinois and Oak Ridge National Laboratory (ORNL) in Tennessee. This movement of some 29 million files, twice, undertaken in order to establish new ESGF nodes at ANL and ORNL, was performed largely automatically by a simple replication tool, a script that invoked Globus to transfer large bundles of files while tracking progress in a database. Under the covers, Globus organized transfers to make efficient use of the high-speed Energy Sciences network (ESnet) and the data transfer nodes deployed at participating sites, and also addressed security, integrity checking, and recovery from a variety of transient failures. This success demonstrates the considerable benefits that can accrue from the adoption of performant data replication infrastructure.
Paper Structure (11 sections, 7 figures, 3 tables)

This paper contains 11 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Federated CMIP6 cumulative data footprint, as of 2024-04-08: Datasets (above) and bytes (below).
  • Figure 2: The DOE Energy Sciences network, ESnet, as of 2022, with the three sites involved in this data replication task highlighted. Map from https://www.es.net/about/
  • Figure 3: Principal elements of a high-performance climate data replication framework. High-performance storage systems at two sites, A and B, are connected to Data Transfer Nodes optimized for high-speed data movement and themselves connected to a wide area network (WAN) via a clean, high-bandwidth network path. Globus orchestrates data transfers, negotiating authentication and authorization, configuring transfer parameters for high speed data movement, checking integrity of transfers, and detecting and responding to failures.
  • Figure 4: The logic used by the data replication script
  • Figure 5: Two views of the replication task. Above: Cumulative bytes received at ALCF and OLCF, shown as separate lines, with some significant phases labeled. Below: Instantaneous transfer rates for the four source-destination pairs, each depicted with a different color. See text for further discussion.
  • ...and 2 more figures