Table of Contents
Fetching ...

Advancing Anomaly Detection in Computational Workflows with Active Learning

Krishnan Raghavan, George Papadimitriou, Hongwei Jin, Anirban Mandal, Mariam Kiran, Prasanna Balaprakash, Ewa Deelman

TL;DR

This work tackles anomaly detection in large-scale computational workflows by addressing data scarcity through an active-learning loop embedded in Poseidon-X, an on-demand data-collection framework built atop Pegasus and two cloud testbeds (FABRIC and Chameleon). The authors formalize a continual-learning-driven active-learning strategy and implement a data-generation mechanism that queries the infrastructure to produce labeled anomaly traces, enabling both self-supervised and supervised GNN-based detectors. Through an end-to-end live experiment and emulations on Flow-Bench data, they show that uncertainty-guided data generation reduces training data requirements and improves detection accuracy, with notable gains in supervised settings and robust uncertainty reduction. The results demonstrate the practicality and scalability of on-demand experimentation for workflow anomaly detection, and the authors outline future directions including transfer learning to broaden applicability across diverse workflow types.

Abstract

A computational workflow, also known as workflow, consists of tasks that are executed in a certain order to attain a specific computational campaign. Computational workflows are commonly employed in science domains, such as physics, chemistry, genomics, to complete large-scale experiments in distributed and heterogeneous computing environments. However, running computations at such a large scale makes the workflow applications prone to failures and performance degradation, which can slowdown, stall, and ultimately lead to workflow failure. Learning how these workflows behave under normal and anomalous conditions can help us identify the causes of degraded performance and subsequently trigger appropriate actions to resolve them. However, learning in such circumstances is a challenging task because of the large volume of high-quality historical data needed to train accurate and reliable models. Generating such datasets not only takes a lot of time and effort but it also requires a lot of resources to be devoted to data generation for training purposes. Active learning is a promising approach to this problem. It is an approach where the data is generated as required by the machine learning model and thus it can potentially reduce the training data needed to derive accurate models. In this work, we present an active learning approach that is supported by an experimental framework, Poseidon-X, that utilizes a modern workflow management system and two cloud testbeds. We evaluate our approach using three computational workflows. For one workflow we run an end-to-end live active learning experiment, for the other two we evaluate our active learning algorithms using pre-captured data traces provided by the Flow-Bench benchmark. Our findings indicate that active learning not only saves resources, but it also improves the accuracy of the detection of anomalies.

Advancing Anomaly Detection in Computational Workflows with Active Learning

TL;DR

This work tackles anomaly detection in large-scale computational workflows by addressing data scarcity through an active-learning loop embedded in Poseidon-X, an on-demand data-collection framework built atop Pegasus and two cloud testbeds (FABRIC and Chameleon). The authors formalize a continual-learning-driven active-learning strategy and implement a data-generation mechanism that queries the infrastructure to produce labeled anomaly traces, enabling both self-supervised and supervised GNN-based detectors. Through an end-to-end live experiment and emulations on Flow-Bench data, they show that uncertainty-guided data generation reduces training data requirements and improves detection accuracy, with notable gains in supervised settings and robust uncertainty reduction. The results demonstrate the practicality and scalability of on-demand experimentation for workflow anomaly detection, and the authors outline future directions including transfer learning to broaden applicability across diverse workflow types.

Abstract

A computational workflow, also known as workflow, consists of tasks that are executed in a certain order to attain a specific computational campaign. Computational workflows are commonly employed in science domains, such as physics, chemistry, genomics, to complete large-scale experiments in distributed and heterogeneous computing environments. However, running computations at such a large scale makes the workflow applications prone to failures and performance degradation, which can slowdown, stall, and ultimately lead to workflow failure. Learning how these workflows behave under normal and anomalous conditions can help us identify the causes of degraded performance and subsequently trigger appropriate actions to resolve them. However, learning in such circumstances is a challenging task because of the large volume of high-quality historical data needed to train accurate and reliable models. Generating such datasets not only takes a lot of time and effort but it also requires a lot of resources to be devoted to data generation for training purposes. Active learning is a promising approach to this problem. It is an approach where the data is generated as required by the machine learning model and thus it can potentially reduce the training data needed to derive accurate models. In this work, we present an active learning approach that is supported by an experimental framework, Poseidon-X, that utilizes a modern workflow management system and two cloud testbeds. We evaluate our approach using three computational workflows. For one workflow we run an end-to-end live active learning experiment, for the other two we evaluate our active learning algorithms using pre-captured data traces provided by the Flow-Bench benchmark. Our findings indicate that active learning not only saves resources, but it also improves the accuracy of the detection of anomalies.
Paper Structure (26 sections, 5 equations, 10 figures, 1 algorithm)

This paper contains 26 sections, 5 equations, 10 figures, 1 algorithm.

Figures (10)

  • Figure 1: Overview of the end-to-end active learning framework. This framework has three modules, the neural network (graph neural network or self-supervised learning), the active learning module and the Poseidon-X infrastructure. The active learning module trains the network and passes sensitivity scores to the Poseidon-X infrastructure which then generates the requisite data.
  • Figure 2: Poseidon-X experimental infrastructure that operates as the back-end of our active learning framework. Poseidon-X spans across the Chameleon Cloud and the FABRIC testbed. Chameleon hosts the workers while FABRIC host the networking infrastructure, the workflow submission node and the data storage node. To use the resources efficiently Poseidon-X uses Docker containers on baremetal nodes and injects CPU and HDD interference on selected containers using cgroups. To model network anomalies, flows are being directed faulty paths in FABRIC. An experimental controller on the workflow submission node orchestrates the anomaly injection, workflow execution triggering and data labelling.
  • Figure 3: A pictorial representation of the self-supervised learning model jin2023ssl. The SSL model is comprised of a graph driven encoder and a multi-layer perceptron-based decoder.
  • Figure 4: This is a pictorial representation of the supervised learning-driven graph neural network module jin2022workflow. This module is comprised of 2 graph convolutional layers appended with a multi-layer perceptron.
  • Figure 5: Training on 1000Genome workflow with a full-scale end-to-end experiment with and without active training.
  • ...and 5 more figures