Low-level I/O Monitoring for Scientific Workflows

Joel Witzke; Ansgar Lößer; Vasilis Bountris; Florian Schintke; Björn Scheuermann

Low-level I/O Monitoring for Scientific Workflows

Joel Witzke, Ansgar Lößer, Vasilis Bountris, Florian Schintke, Björn Scheuermann

TL;DR

This work tackles the problem of associating low-level I/O traces with high-level scientific workflow tasks across distributed compute environments. It evaluates three detailed I/O monitoring approaches—FUSE overlay file systems, ptrace-based tracing, and eBPF—and implements an end-to-end eBPF-based solution to capture kernel-level I/O events with minimal overhead. A key contribution is a set of strategies to bridge low-level traces to physical and logical workflow tasks, leveraging Nextflow-specific behavior and Docker/Kubernetes metadata to map traces to tasks. Demonstrations on nf-core Nextflow workflows show per-task I/O profiles and the practical viability of the approach for bottleneck detection and resource optimization in large-scale scientific analyses.

Abstract

While detailed resource usage monitoring is possible on the low-level using proper tools, associating such usage with higher-level abstractions in the application layer that actually cause the resource usage in the first place presents a number of challenges. Suppose a large-scale scientific data analysis workflow is run using a distributed execution environment such as a compute cluster or cloud environment and we want to analyze the I/O behaviour of it to find and alleviate potential bottlenecks. Different tasks of the workflow can be assigned to arbitrary compute nodes and may even share the same compute nodes. Thus, locally observed resource usage is not directly associated with the individual workflow tasks. By acquiring resource usage profiles of the involved nodes, we seek to correlate the trace data to the workflow and its individual tasks. To accomplish that, we select the proper set of metadata associated with low-level traces that let us associate them with higher-level task information obtained from log files of the workflow execution as well as the job management using a task orchestrator such as Kubernetes with its container management. Ensuring a proper information chain allows the classification of observed I/O on a logical task level and may reveal the most costly or inefficient tasks of a scientific workflow that are most promising for optimization.

Low-level I/O Monitoring for Scientific Workflows

TL;DR

Abstract

Paper Structure (19 sections, 5 figures)

This paper contains 19 sections, 5 figures.

Introduction
Background: Scientific Workflow Execution
Detailed Low-Level I/O Monitoring
I/O tracing with a FUSE overlay file system
I/O tracing with ptrace
I/O tracing with extended Berkeley Packet Filter (eBPF)
I/O Monitoring Using eBPF in Detail
The eBPF usermode process and eBPF function
Observed Limitations and Challenges
Associating Traces with Workflow Tasks
Information Gap
Bridging the Gap Using Special Nextflow Behavior
Bridging the Gap Using Docker and Kubernetes
Bringing Distributed Log Data Together
Demonstration
...and 4 more sections

Figures (5)

Figure 1: Components involved in different monitoring approaches.
Figure 2: Exemplary excerpts from different logs and traces (colorized to show links).
Figure 3: Architecture of mapping low-level monitoring data to upper-level identities such as tasks using information such as a task's individual working directory wd (blue), taskid given in tags (red), the cgroupid of processes (green), its pid (violet) and Kubernetes pod (teal) and annotations kub annotations (orange). Following chains of this information from component to component, the low-level trace data can be associated to the upper-level tasks of the workflow engine, also across machine boundaries.
Figure 4: Duration between first and last access of a file from a task relative to the task's runtime. Analyzed from a execution of the rnaseq workflow of nf-core in the test profile. A larger fraction for a file likely indicates a streaming-like access while a small fraction represents more bulky file access of a task.
Figure 5: File is accesses by two tasks. TRIMGALORE creates and writes it between 32 and 42% of its runtime of 9.91s in total. The same task also reads the file back once later. A succeeding task, BBMAP_BBSPLIT reads it over almost its whole lifespan of 24.16s.

Low-level I/O Monitoring for Scientific Workflows

TL;DR

Abstract

Low-level I/O Monitoring for Scientific Workflows

Authors

TL;DR

Abstract

Table of Contents

Figures (5)