Table of Contents
Fetching ...

Octopus: Experiences with a Hybrid Event-Driven Architecture for Distributed Scientific Computing

Haochen Pan, Ryan Chard, Sicheng Zhou, Alok Kamatar, Rafael Vescovi, Valérie Hayot-Sasson, André Bauer, Maxime Gonthier, Kyle Chard, Ian Foster

TL;DR

Octopus is introduced, a hybrid, cloud-to-edge event fabric designed to link many local event producers and consumers with cloud-hosted brokers, and to provide a fabric for developing resilient applications to support the development of scientific EDA.

Abstract

Scientific research increasingly relies on distributed computational resources, storage systems, networks, and instruments, ranging from HPC and cloud systems to edge devices. Event-driven architecture (EDA) benefits applications targeting distributed research infrastructures by enabling the organization, communication, processing, reliability, and security of events generated from many sources. To support the development of scientific EDA, we introduce Octopus, a hybrid, cloud-to-edge event fabric designed to link many local event producers and consumers with cloud-hosted brokers. Octopus can be scaled to meet demand, permits the deployment of highly available Triggers for automatic event processing, and enforces fine-grained access control. We identify requirements in self-driving laboratories, scientific data automation, online task scheduling, epidemic modeling, and dynamic workflow management use cases, and present results demonstrating Octopus' ability to meet those requirements. Octopus supports producing and consuming events at a rate of over 4.2 M and 9.6 M events per second, respectively, from distributed clients.

Octopus: Experiences with a Hybrid Event-Driven Architecture for Distributed Scientific Computing

TL;DR

Octopus is introduced, a hybrid, cloud-to-edge event fabric designed to link many local event producers and consumers with cloud-hosted brokers, and to provide a fabric for developing resilient applications to support the development of scientific EDA.

Abstract

Scientific research increasingly relies on distributed computational resources, storage systems, networks, and instruments, ranging from HPC and cloud systems to edge devices. Event-driven architecture (EDA) benefits applications targeting distributed research infrastructures by enabling the organization, communication, processing, reliability, and security of events generated from many sources. To support the development of scientific EDA, we introduce Octopus, a hybrid, cloud-to-edge event fabric designed to link many local event producers and consumers with cloud-hosted brokers. Octopus can be scaled to meet demand, permits the deployment of highly available Triggers for automatic event processing, and enforces fine-grained access control. We identify requirements in self-driving laboratories, scientific data automation, online task scheduling, epidemic modeling, and dynamic workflow management use cases, and present results demonstrating Octopus' ability to meet those requirements. Octopus supports producing and consuming events at a rate of over 4.2 M and 9.6 M events per second, respectively, from distributed clients.
Paper Structure (30 sections, 8 figures, 3 tables)

This paper contains 30 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: The Octopus event fabric spans locations and provides an available and resilient basis for developing applications. Various event topics allow filtered views by consumers.
  • Figure 2: Octopus architecture. Users interact with the web service (green arrows) to acquire credentials, where their identity is mapped to an IAM entity and shared with the MSK ZooKeeper. Producers (left) and consumers (right) communicate with the event fabric to publish and receive events, respectively (blue arrows). A trigger can be configured to act on events and invoke remote actions (yellow arrows). Events can also be persisted to reliable cloud storage when enabled (red arrows). Monitoring consoles are available for admins to monitor the system's live status and historical statistics.
  • Figure 3: Median (top) and 99th percentile (bottom) latencies vs. throughput for configurations 1--6 on baseline cluster with remote producers.
  • Figure 4: Trigger scaling using a topic with 128 partitions to receive individual messages and sleep for 30 s. Messages are slowly processed until the processing pressure results in 128 concurrent functions, then scaling down shortly before the workload is complete.
  • Figure 5: Producer and consumer throughput vs. number of topics. For producer, increases to 273 K $\times$ 1 KB event/s at four topics and then remains flat. For consumer, increases to 846 K $\times$ 1 KB event/s at 16 topics.
  • ...and 3 more figures