Table of Contents
Fetching ...

Flow-Bench: A Dataset for Computational Workflow Anomaly Detection

George Papadimitriou, Hongwei Jin, Cong Wang, Rajiv Mayani, Krishnan Raghavan, Anirban Mandal, Prasanna Balaprakash, Ewa Deelman

TL;DR

FlowBench provides a publicly available multi-modal dataset and benchmark suite for anomaly detection in computational workflows, addressing the scarcity of open DAG-aware benchmarks for distributed HPC environments. It systematically injects synthetic anomalies across twelve diverse workflows, collecting both raw execution logs and parsed representations to support tabular, graph, and text-based analyses. The paper benchmarks supervised and unsupervised approaches, including PyOD, PyGOD, graph neural networks, and LLM-based supervised fine-tuning, highlighting scalability and performance trade-offs on large DAGs. This resource enables cross-domain evaluation of anomaly detection methods, fosters DAG-structure-aware modeling, and supports reproducibility and future research in reliable scientific workflows.

Abstract

A computational workflow, also known as workflow, consists of tasks that must be executed in a specific order to attain a specific goal. Often, in fields such as biology, chemistry, physics, and data science, among others, these workflows are complex and are executed in large-scale, distributed, and heterogeneous computing environments prone to failures and performance degradation. Therefore, anomaly detection for workflows is an important paradigm that aims to identify unexpected behavior or errors in workflow execution. This crucial task to improve the reliability of workflow executions can be further assisted by machine learning-based techniques. However, such application is limited, in large part, due to the lack of open datasets and benchmarking. To address this gap, we make the following contributions in this paper: (1) we systematically inject anomalies and collect raw execution logs from workflows executing on distributed infrastructures; (2) we summarize the statistics of new datasets, and provide insightful analyses; (3) we convert workflows into tabular, graph and text data, and benchmark with supervised and unsupervised anomaly detection techniques correspondingly. The presented dataset and benchmarks allow examining the effectiveness and efficiency of scientific computational workflows and identifying potential research opportunities for improvement and generalization. The dataset and benchmark code are publicly available \url{https://poseidon-workflows.github.io/FlowBench/} under the MIT License.

Flow-Bench: A Dataset for Computational Workflow Anomaly Detection

TL;DR

FlowBench provides a publicly available multi-modal dataset and benchmark suite for anomaly detection in computational workflows, addressing the scarcity of open DAG-aware benchmarks for distributed HPC environments. It systematically injects synthetic anomalies across twelve diverse workflows, collecting both raw execution logs and parsed representations to support tabular, graph, and text-based analyses. The paper benchmarks supervised and unsupervised approaches, including PyOD, PyGOD, graph neural networks, and LLM-based supervised fine-tuning, highlighting scalability and performance trade-offs on large DAGs. This resource enables cross-domain evaluation of anomaly detection methods, fosters DAG-structure-aware modeling, and supports reproducibility and future research in reliable scientific workflows.

Abstract

A computational workflow, also known as workflow, consists of tasks that must be executed in a specific order to attain a specific goal. Often, in fields such as biology, chemistry, physics, and data science, among others, these workflows are complex and are executed in large-scale, distributed, and heterogeneous computing environments prone to failures and performance degradation. Therefore, anomaly detection for workflows is an important paradigm that aims to identify unexpected behavior or errors in workflow execution. This crucial task to improve the reliability of workflow executions can be further assisted by machine learning-based techniques. However, such application is limited, in large part, due to the lack of open datasets and benchmarking. To address this gap, we make the following contributions in this paper: (1) we systematically inject anomalies and collect raw execution logs from workflows executing on distributed infrastructures; (2) we summarize the statistics of new datasets, and provide insightful analyses; (3) we convert workflows into tabular, graph and text data, and benchmark with supervised and unsupervised anomaly detection techniques correspondingly. The presented dataset and benchmarks allow examining the effectiveness and efficiency of scientific computational workflows and identifying potential research opportunities for improvement and generalization. The dataset and benchmark code are publicly available \url{https://poseidon-workflows.github.io/FlowBench/} under the MIT License.
Paper Structure (35 sections, 17 figures, 6 tables)

This paper contains 35 sections, 17 figures, 6 tables.

Figures (17)

  • Figure 1: Overview of FlowBench
  • Figure 2: Overview of the execution infrastructure. The deployment spans across the Chameleon Cloud and the FABRIC testbed. Chameleon hosts the workers while FABRIC hosts the networking infrastructure, the workflow submission node and the data storage node. Docker containers are deployed on baremetal nodes and interference (CPU, HDD) is introduced using cgroups. An experimental controller on the workflow submission node orchestrates the anomaly injection, workflow execution triggering and data labeling.
  • Figure 3: Overview of the 1000Genome sequencing analysis workflow. The workflow creates a branch for each chromosome and each individual task is processing a subset of the Phase 3 data (equally distributed).
  • Figure 4: Overview of the Montage workflow. In this case, the workflow uses images captured by the Digitized Sky Survey (DSS) dss-archive and creates a branch for each band that is requested to be processed during the workflow generation. The size of the first level of each branch depends on the size of the section of the sky to be analyzed, while the second level on the number of overlapping images stored in the archive.
  • Figure 5: Overview of the Predict Future Sales workflow. The workflow splits the data into 3 item categories and trains 3 XGBoost models that are later combined, using an ensemble technique. It contains 3 hyperparameter tuning subworkfows, that test different sets of features and picks the best performing one. The number of HPO tasks is configurable and depends on the number of combinations that will be tested.
  • ...and 12 more figures