Table of Contents
Fetching ...

PRE-Share Data: Assistance Tool for Resource-aware Designing of Data-sharing Pipelines

Sepideh Masoudi

TL;DR

The paper addresses the complexity of designing data-sharing pipelines under governance and resource constraints. It introduces PRE-share Data, a cloud-based tool that automatically detects reuse opportunities across data transformation pipelines, generates reuse-based configurations, and provides resource-impact reports using parameterized templates on Kubeflow. The key contributions are an open-source architecture with components (Coordinator, Reuse Strategy Engine, Exporter), embedded transformation templates, and Prometheus-based reporting to compare designs and estimate savings. This work enables more reusable, resource-aware data-sharing pipelines within self-service data platforms, with planned extensions to additional platforms and richer cost-performance reporting.

Abstract

Data is a valuable asset, and sharing it as a product across organizations is key to building comprehensive and useful insights in fields such as science and industry. Before sharing, data often requires transformation to comply with governance policies and meet the requirements of recipient organizations. By leveraging pipelines, these transformations can be modeled as chains of processes; however, designing such pipelines while ensuring their efficiency is complex. In this paper, we present a tool that supports the design of pipelines by identifying opportunities for reusing transformation processes across different pipelines and suggesting designs and configurations based on these opportunities. This tool also generates reports on the resource consumption of pipeline processes, enabling the estimation of potential resource savings achievable through reuse-based designs. It could serve as a foundation for more efficient and resource-conscious data transformation pipeline design and be used as a component in self-service data platforms.

PRE-Share Data: Assistance Tool for Resource-aware Designing of Data-sharing Pipelines

TL;DR

The paper addresses the complexity of designing data-sharing pipelines under governance and resource constraints. It introduces PRE-share Data, a cloud-based tool that automatically detects reuse opportunities across data transformation pipelines, generates reuse-based configurations, and provides resource-impact reports using parameterized templates on Kubeflow. The key contributions are an open-source architecture with components (Coordinator, Reuse Strategy Engine, Exporter), embedded transformation templates, and Prometheus-based reporting to compare designs and estimate savings. This work enables more reusable, resource-aware data-sharing pipelines within self-service data platforms, with planned extensions to additional platforms and richer cost-performance reporting.

Abstract

Data is a valuable asset, and sharing it as a product across organizations is key to building comprehensive and useful insights in fields such as science and industry. Before sharing, data often requires transformation to comply with governance policies and meet the requirements of recipient organizations. By leveraging pipelines, these transformations can be modeled as chains of processes; however, designing such pipelines while ensuring their efficiency is complex. In this paper, we present a tool that supports the design of pipelines by identifying opportunities for reusing transformation processes across different pipelines and suggesting designs and configurations based on these opportunities. This tool also generates reports on the resource consumption of pipeline processes, enabling the estimation of potential resource savings achievable through reuse-based designs. It could serve as a foundation for more efficient and resource-conscious data transformation pipeline design and be used as a component in self-service data platforms.

Paper Structure

This paper contains 5 sections, 3 figures.

Figures (3)

  • Figure 1: Architecture of the PRE-share Data tool: an assistance tool for discovering resource-aware reuse-based designs in pipeline configurations.
  • Figure 2: Sample input YAML configuration file for PRE, defining the pipelines and their parameter values for execution.
  • Figure 3: Sample reuse-based configuration generated by PRE, formatted for execution in PRE. PRE will define new pipelines of common process and create chain of pipelines by reusing new pipelines across different pipelines. Auto generated pipelines names start with 'auto-gen.*'.