Table of Contents
Fetching ...

AERO: An autonomous platform for continuous research

Valérie Hayot-Sasson, Abby Stevens, Nicholson Collier, Sudershan Sridhar, Kyle Conroy, J. Gregory Pauloski, Yadu Babuji, Maxime Gonthier, Nathaniel Hudson, Dante D. Sanchez-Gallegos, Ian Foster, Jonathan Ozik, Kyle Chard

TL;DR

The paper addresses gaps in data infrastructure for automated, continuous, cross-domain scientific research during health crises. It introduces AERO, an event-based automation platform that leverages a trigger-action paradigm, a distributed bring-your-own-resource compute/storage model, and Globus services (Compute, Flows, and Auth) to automate ingestion, validation, analysis, and sharing with provenance. Evaluation on a synthetic workload and two $R(t)$ estimation use cases demonstrates near-linear scalability and, on average, faster performance for Globus Flows compared to GitHub Actions, while highlighting variability from external services. The work provides open-source tooling and data references to enable reproducible, FAIR-compliant automated research across institutions, with practical impact for public health surveillance and beyond.

Abstract

The COVID-19 pandemic highlighted the need for new data infrastructure, as epidemiologists and public health workers raced to harness rapidly evolving data, analytics, and infrastructure in support of cross-sector investigations. To meet this need, we developed AERO, an automated research and data sharing platform for continuous, distributed, and multi-disciplinary collaboration. In this paper, we describe the AERO design and how it supports the automatic ingestion, validation, and transformation of monitored data into a form suitable for analysis; the automated execution of analyses on this data; and the sharing of data among different entities. We also describe how our AERO implementation leverages capabilities provided by the Globus platform and GitHub for automation, distributed execution, data sharing, and authentication. We present results obtained with an instance of AERO running two public health surveillance applications and demonstrate benchmarking results with a synthetic application, all of which are publicly available for testing.

AERO: An autonomous platform for continuous research

TL;DR

The paper addresses gaps in data infrastructure for automated, continuous, cross-domain scientific research during health crises. It introduces AERO, an event-based automation platform that leverages a trigger-action paradigm, a distributed bring-your-own-resource compute/storage model, and Globus services (Compute, Flows, and Auth) to automate ingestion, validation, analysis, and sharing with provenance. Evaluation on a synthetic workload and two estimation use cases demonstrates near-linear scalability and, on average, faster performance for Globus Flows compared to GitHub Actions, while highlighting variability from external services. The work provides open-source tooling and data references to enable reproducible, FAIR-compliant automated research across institutions, with practical impact for public health surveillance and beyond.

Abstract

The COVID-19 pandemic highlighted the need for new data infrastructure, as epidemiologists and public health workers raced to harness rapidly evolving data, analytics, and infrastructure in support of cross-sector investigations. To meet this need, we developed AERO, an automated research and data sharing platform for continuous, distributed, and multi-disciplinary collaboration. In this paper, we describe the AERO design and how it supports the automatic ingestion, validation, and transformation of monitored data into a form suitable for analysis; the automated execution of analyses on this data; and the sharing of data among different entities. We also describe how our AERO implementation leverages capabilities provided by the Globus platform and GitHub for automation, distributed execution, data sharing, and authentication. We present results obtained with an instance of AERO running two public health surveillance applications and demonstrate benchmarking results with a synthetic application, all of which are publicly available for testing.

Paper Structure

This paper contains 9 sections, 7 figures.

Figures (7)

  • Figure 1: AERO enables automated workflow execution by incorporating five core components. Users interact with the metadata server to register flows using rule-based triggers, which drive the automation. When rules are satisfied, they trigger user-defined actions that execute within predefined ingestion or analysis flows on user-provided compute and storage resources. As AERO leverages user-provided data and infrastructure, a security model is used to ensure that access to unauthorized resources is forbidden.
  • Figure 2: AERO reference implementation event flow. Data is ingested from external sources via an automated timer-based ingestion flow that fetches, validates and stores the data on external storage at periodic intervals. The ingested data can then be queried through the centralized metadata server, and incorporated into automated analysis flows, which produce reports and visualizations that can then be shared with stakeholders, all through the platform.
  • Figure 3: AERO's bring-your-own-resources model. Centralized services in AERO include a database for capturing metadata, the web services, and the user-specified rule-based triggers
  • Figure 4: AERO Search web interface.
  • Figure 5: Makespan of synthetic ingestion application using GitHub Actions and Globus Flows over 5 repetitions
  • ...and 2 more figures