Table of Contents
Fetching ...

HyProv: Hybrid Provenance Management for Scientific Workflows

Vasilis Bountris, Lauritz Thamsen, Ulf Leser

TL;DR

HyProv addresses the need for online, scalable, workflow-aware provenance in scientific workflows executed on distributed clusters. It combines a central eDAG-based store for workflow-spec provenance with federated querying over existing infrastructure monitoring and logging systems, mediated by an event-driven layer. The prototype demonstrates sub-second query latencies, low overhead, and good scalability across increasing workflow sizes, by leveraging Redis for the central graph and Prometheus/Elasticsearch for federated data. This hybrid architecture enables complex, cross-layer provenance queries during execution and after completion, facilitating debugging, optimization, and failure analysis with minimal disruption to the underlying cluster. The work shows promise for broader adoption and extension to additional workflow engines and infrastructure backends, with potential gains in query efficiency and extendibility for provenance ecosystems.

Abstract

Provenance plays a crucial role in scientific workflow execution, for instance by providing data for failure analysis, real-time monitoring, or statistics on resource utilization for right-sizing allocations. The workflows themselves, however, become increasingly complex in terms of involved components. Furthermore, they are executed on distributed cluster infrastructures, which makes the real-time collection, integration, and analysis of provenance data challenging. Existing provenance systems struggle to balance scalability, real-time processing, online provenance analytics, and integration across different components and compute resources. Moreover, most provenance solutions are not workflow-aware; by focusing on arbitrary workloads, they miss opportunities for workflow systems where optimization and analysis can exploit the availability of a workflow specification that dictates, to some degree, task execution orders and provides abstractions for physical tasks at a logical level. In this paper, we present HyProv, a hybrid provenance management system that combines centralized and federated paradigms to offer scalable, online, and workflow-aware queries over workflow provenance traces. HyProv uses a centralized component for efficient management of the small and stable workflow-specification-specific provenance, and complements this with federated querying over different scalable monitoring and provenance databases for the large-scale execution logs. This enables low-latency access to current execution data. Furthermore, the design supports complex provenance queries, which we exemplify for the workflow system Airflow in combination with the resource manager Kubernetes. Our experiments indicate that HyProv scales to large workflows, answers provenance queries with sub-second latencies, and adds only modest CPU and memory overhead to the cluster.

HyProv: Hybrid Provenance Management for Scientific Workflows

TL;DR

HyProv addresses the need for online, scalable, workflow-aware provenance in scientific workflows executed on distributed clusters. It combines a central eDAG-based store for workflow-spec provenance with federated querying over existing infrastructure monitoring and logging systems, mediated by an event-driven layer. The prototype demonstrates sub-second query latencies, low overhead, and good scalability across increasing workflow sizes, by leveraging Redis for the central graph and Prometheus/Elasticsearch for federated data. This hybrid architecture enables complex, cross-layer provenance queries during execution and after completion, facilitating debugging, optimization, and failure analysis with minimal disruption to the underlying cluster. The work shows promise for broader adoption and extension to additional workflow engines and infrastructure backends, with potential gains in query efficiency and extendibility for provenance ecosystems.

Abstract

Provenance plays a crucial role in scientific workflow execution, for instance by providing data for failure analysis, real-time monitoring, or statistics on resource utilization for right-sizing allocations. The workflows themselves, however, become increasingly complex in terms of involved components. Furthermore, they are executed on distributed cluster infrastructures, which makes the real-time collection, integration, and analysis of provenance data challenging. Existing provenance systems struggle to balance scalability, real-time processing, online provenance analytics, and integration across different components and compute resources. Moreover, most provenance solutions are not workflow-aware; by focusing on arbitrary workloads, they miss opportunities for workflow systems where optimization and analysis can exploit the availability of a workflow specification that dictates, to some degree, task execution orders and provides abstractions for physical tasks at a logical level. In this paper, we present HyProv, a hybrid provenance management system that combines centralized and federated paradigms to offer scalable, online, and workflow-aware queries over workflow provenance traces. HyProv uses a centralized component for efficient management of the small and stable workflow-specification-specific provenance, and complements this with federated querying over different scalable monitoring and provenance databases for the large-scale execution logs. This enables low-latency access to current execution data. Furthermore, the design supports complex provenance queries, which we exemplify for the workflow system Airflow in combination with the resource manager Kubernetes. Our experiments indicate that HyProv scales to large workflows, answers provenance queries with sub-second latencies, and adds only modest CPU and memory overhead to the cluster.

Paper Structure

This paper contains 35 sections, 2 figures, 7 tables.

Figures (2)

  • Figure 1: HyProv system architecture, with implementation software noted in parentheses. The core HyProv components—the Event Mediation Layer & Processing Module (blue), the Central Provenance Store (green), and the Query Interface (purple)—form the core of the system. They integrate with existing infrastructure components, shown in black and red, such as the workflow engine (Airflow), resource manager (Kubernetes), and monitoring databases (Prometheus, Elasticsearch).
  • Figure 2: CPU usage for all instances of a specific abstract task.