Employing Artificial Intelligence to Steer Exascale Workflows with Colmena

Logan Ward; J. Gregory Pauloski; Valerie Hayot-Sasson; Yadu Babuji; Alexander Brace; Ryan Chard; Kyle Chard; Rajeev Thakur; Ian Foster

Employing Artificial Intelligence to Steer Exascale Workflows with Colmena

Logan Ward, J. Gregory Pauloski, Valerie Hayot-Sasson, Yadu Babuji, Alexander Brace, Ryan Chard, Kyle Chard, Rajeev Thakur, Ian Foster

TL;DR

The design of Colmena is described, the challenges it overcame while deploying applications on exascale systems, and the science workflows it has enhanced through interweaving AI are described.

Abstract

Computational workflows are a common class of application on supercomputers, yet the loosely coupled and heterogeneous nature of workflows often fails to take full advantage of their capabilities. We created Colmena to leverage the massive parallelism of a supercomputer by using Artificial Intelligence (AI) to learn from and adapt a workflow as it executes. Colmena allows scientists to define how their application should respond to events (e.g., task completion) as a series of cooperative agents. In this paper, we describe the design of Colmena, the challenges we overcame while deploying applications on exascale systems, and the science workflows we have enhanced through interweaving AI. The scaling challenges we discuss include developing steering strategies that maximize node utilization, introducing data fabrics that reduce communication overhead of data-intensive tasks, and implementing workflow tasks that cache costly operations between invocations. These innovations coupled with a variety of application patterns accessible through our agent-based steering model have enabled science advances in chemistry, biophysics, and materials science using different types of AI. Our vision is that Colmena will spur creative solutions that harness AI across many domains of scientific computing.

Employing Artificial Intelligence to Steer Exascale Workflows with Colmena

TL;DR

The design of Colmena is described, the challenges it overcame while deploying applications on exascale systems, and the science workflows it has enhanced through interweaving AI are described.

Abstract

Paper Structure (29 sections, 7 figures)

This paper contains 29 sections, 7 figures.

Introduction
Related Work
AI Approaches for Science Workflows
Workflow Engines
Integrating AI and Workflows
Design
Programming Model
Agent Types
Threading
Defining Tasks
Task Queues
Task Execution
Data Fabric
Scaling on Supercomputers
Case Study: Molecular Design
...and 14 more sections

Figures (7)

Figure 1: A Colmena application is composed of a Thinker and Task Server connected by a task queue. Thinkers define the policy for submitting computations using a series of agents that interact with each other and the Task Servers. Task Servers delegate computations to workers running on compute nodes. Applications that manage large datasets or run at large scales can use ProxyStore to pass references to inputs and outputs via the workflow engine and object data via a side channel.
Figure 2: Allocation of HPC nodes between different tasks over time for a Colmena-based molecular design application. Nodes may either run quantum chemistry simulations (yellow), train a machine learning model (blue), or use the model to infer the properties of a molecule (red). The application first runs inference on all nodes and then runs simulation tasks until sufficient data is available to begin re-training machine learning and re-running inference on a subset of nodes. Light shades indicate periods where either no computation was running or the running calculation did not complete before the end of the allocation. Figure from ward2021colmena.
Figure 3: Weak scaling of inference rate as a function of node count for a molecular design application that uses message passing neural networks. Experiments were performed on the Theta Supercomputer at Argonne National Laboratory. Figure from ward2021colmena.
Figure 4: (a) Scientific output of our molecular design application over time and (b) key performance timings (time to complete machine learning tasks, average time between tasks for CPU workers) of a multi-site implementation of our molecular design application with different Colmena backends. Our implementations using the Parsl workflow engine and Parsl with Redis to transmit task data both required maintaining SSH tunnels between sites. The implementation with FuncX and Globus does not require direct network connections between sites, yet has similar scientific output and comparable performance timings.
Figure 5: Utilization of each of 480 nodes (1920 GPUs) of ALCF's Polaris supercomputer over time for a Colmena application that simultaneously trains a reinforcement learning model that generates proteins, generates new proteins with that model, and evaluates the quality of the proteins. Periods of higher utilization are represented as deeper shades and color indicates the type of task being run.
...and 2 more figures

Employing Artificial Intelligence to Steer Exascale Workflows with Colmena

TL;DR

Abstract

Employing Artificial Intelligence to Steer Exascale Workflows with Colmena

Authors

TL;DR

Abstract

Table of Contents

Figures (7)