Table of Contents
Fetching ...

Steering a Fleet: Adaptation for Large-Scale, Workflow-Based Experiments

Jim Pruyne, Valerie Hayot-Sasson, Weijian Zheng, Ryan Chard, Justin M. Wozniak, Tekin Bicer, Kyle Chard, Ian T. Foster

TL;DR

The paper tackles the challenge of coordinating large-scale, data-intensive experiments where many flows operate concurrently across heterogeneous resources. It introduces Braid, a cloud-based decision engine that aggregates run-time measurements into datastreams, computes metrics, and applies policies to steer both individual flows and entire fleets. The key contributions include the formalization of Datastreams, Metrics, and Policy, a REST-based implementation with authorization, and flow-oriented interfaces (policy_eval, policy_wait, add_sample) that enable dynamic, policy-driven adaptation. Empirical evaluation demonstrates scalable ingestion and metric evaluation performance, and the authors illustrate practical applications through an HEDM workflow that achieves substantial efficiency gains by avoiding unnecessary scans. Overall, Braid provides a scalable, adaptable framework to improve the rate, reliability, and reproducibility of complex, data-driven experiments in cloud environments.

Abstract

Experimental science is increasingly driven by instruments that produce vast volumes of data and thus a need to manage, compute, describe, and index this data. High performance and distributed computing provide the means of addressing the computing needs; however, in practice, the variety of actions required and the distributed set of resources involved, requires sophisticated "flows" defining the steps to be performed on data. As each scan or measurement is performed by an instrument, a new instance of the flow is initiated resulting in a "fleet" of concurrently running flows, with the overall goal to process all the data collected during a potentially long-running experiment. During the course of the experiment, each flow may need to adapt its execution due to changes in the environment, such as computational or storage resource availability, or based on the progress of the fleet as a whole such as completion or discovery of an intermediate result leading to a change in subsequent flow's behavior. We introduce a cloud-based decision engine, Braid, which flows consult during execution to query their run-time environment and coordinate with other flows within their fleet. Braid accepts streams of measurements taken from the run-time environment or from within flow runs which can then be statistically aggregated and compared to other streams to determine a strategy to guide flow execution. For example, queue lengths in execution environments can be used to direct a flow to run computations in one environment or another, or experiment progress as measured by individual flows can be aggregated to determine the progress and subsequent direction of the flows within a fleet. We describe Braid, its interface, implementation and performance characteristics. We further show through examples and experience modifying an existing scientific flow how Braid is used to make adaptable flows.

Steering a Fleet: Adaptation for Large-Scale, Workflow-Based Experiments

TL;DR

The paper tackles the challenge of coordinating large-scale, data-intensive experiments where many flows operate concurrently across heterogeneous resources. It introduces Braid, a cloud-based decision engine that aggregates run-time measurements into datastreams, computes metrics, and applies policies to steer both individual flows and entire fleets. The key contributions include the formalization of Datastreams, Metrics, and Policy, a REST-based implementation with authorization, and flow-oriented interfaces (policy_eval, policy_wait, add_sample) that enable dynamic, policy-driven adaptation. Empirical evaluation demonstrates scalable ingestion and metric evaluation performance, and the authors illustrate practical applications through an HEDM workflow that achieves substantial efficiency gains by avoiding unnecessary scans. Overall, Braid provides a scalable, adaptable framework to improve the rate, reliability, and reproducibility of complex, data-driven experiments in cloud environments.

Abstract

Experimental science is increasingly driven by instruments that produce vast volumes of data and thus a need to manage, compute, describe, and index this data. High performance and distributed computing provide the means of addressing the computing needs; however, in practice, the variety of actions required and the distributed set of resources involved, requires sophisticated "flows" defining the steps to be performed on data. As each scan or measurement is performed by an instrument, a new instance of the flow is initiated resulting in a "fleet" of concurrently running flows, with the overall goal to process all the data collected during a potentially long-running experiment. During the course of the experiment, each flow may need to adapt its execution due to changes in the environment, such as computational or storage resource availability, or based on the progress of the fleet as a whole such as completion or discovery of an intermediate result leading to a change in subsequent flow's behavior. We introduce a cloud-based decision engine, Braid, which flows consult during execution to query their run-time environment and coordinate with other flows within their fleet. Braid accepts streams of measurements taken from the run-time environment or from within flow runs which can then be statistically aggregated and compared to other streams to determine a strategy to guide flow execution. For example, queue lengths in execution environments can be used to direct a flow to run computations in one environment or another, or experiment progress as measured by individual flows can be aggregated to determine the progress and subsequent direction of the flows within a fleet. We describe Braid, its interface, implementation and performance characteristics. We further show through examples and experience modifying an existing scientific flow how Braid is used to make adaptable flows.
Paper Structure (28 sections, 4 figures)

This paper contains 28 sections, 4 figures.

Figures (4)

  • Figure 1: Request rates driven by a single client
  • Figure 2: Request and failure rates driven by multiple, concurrent clients
  • Figure 3: Metric performance across various datastream sizes
  • Figure 4: Progress of the HEDM experiment