Accelerating the Operation of Complex Workflows through Standard Data Interfaces
Taylor Paul, William Regli
TL;DR
The paper tackles cross-WAN scientific workflows by replacing the DAG's point-to-point edges with many-to-many data channels ($PS$ model) mediated by a Session Layer, enabling independent scaling of bottleneck steps. A Preliminary Reference Model and Architecture are proposed, where data producers send to the Session Layer and steps pull inputs and push outputs; for example, updating step $A$ to $A_2$ entails online deployment where $A_2$ reads $M_1$, $C_1$, and $C_2$ and writes to a new object-store prefix, allowing downstream steps to migrate. The approach distinguishes stream, object-store, and batch data-sharing interfaces with QoS considerations and describes data-sharing interfaces as routers enabling movement of steps across networks. Future work targets empirical evaluation of these interfaces, data replication with metrics across WANs, profiling of compute requests, and design of a scheduler to distribute workflow steps and data at optimal interfaces.
Abstract
In this position paper we argue for standardizing how we share and process data in scientific workflows at the network-level to maximize step re-use and workflow portability across platforms and networks in pursuit of a foundational workflow stack. We look to evolve workflows from steps connected point-to-point in a directed acyclic graph (DAG) to steps connected via shared channels in a message system implemented as a network service. To start this evolution, we contribute: a preliminary reference model, architecture, and open tools to implement the architecture today. Our goal stands to improve the deployment and operation of complex workflows by decoupling data sharing and data processing in workflow steps. We seek the workflow community's input on this approach's merit, related research to explore and initial requirements from the workflows community to inform future research.
