Table of Contents
Fetching ...

Cost-Effective Big Data Orchestration Using Dagster: A Multi-Platform Approach

Hernan Picatto, Georg Heiler, Peter Klimek

TL;DR

The paper addresses the high cost and vendor lock-in associated with Spark-based PaaS like Databricks and EMR by proposing a Dagster-based, cloud-agnostic orchestration framework that integrates multiple Spark execution environments. By containerizing Dagster and designing core components (Context Injector, Message Reader, Cloud Client Innovations, Automation/Integration, Dynamic Cloud Client Factory), the approach aims for reproducible, scalable pipelines with reduced operational costs. Empirical results show a 12% performance gain over EMR and a 40% cost reduction versus Databricks, demonstrated on a Common Crawl–derived use case mapping interfirm networks. The work highlights architecture, implementation challenges, and platform trade-offs, offering a practical pathway to cost-efficient, vendor-neutral big data processing with improved prototyping and reproducibility.

Abstract

The rapid advancement of big data technologies has underscored the need for robust and efficient data processing solutions. Traditional Spark-based Platform-as-a-Service (PaaS) solutions, such as Databricks and Amazon Web Services Elastic MapReduce, provide powerful analytics capabilities but often result in high operational costs and vendor lock-in issues. These platforms, while user-friendly, can lead to significant inefficiencies due to their cost structures and lack of transparent pricing. This paper introduces a cost-effective and flexible orchestration framework using Dagster. Our solution aims to reduce dependency on any single PaaS provider by integrating various Spark execution environments. We demonstrate how Dagster's orchestration capabilities can enhance data processing efficiency, enforce best coding practices, and significantly reduce operational costs. In our implementation, we achieved a 12% performance improvement over EMR and a 40% cost reduction compared to DBR, translating to over 300 euros saved per pipeline run. Our goal is to provide a flexible, developer-controlled computing environment that maintains or improves performance and scalability while mitigating the risks associated with vendor lock-in. The proposed framework supports rapid prototyping and testing, which is essential for continuous development and operational efficiency, contributing to a more sustainable model of large data processing.

Cost-Effective Big Data Orchestration Using Dagster: A Multi-Platform Approach

TL;DR

The paper addresses the high cost and vendor lock-in associated with Spark-based PaaS like Databricks and EMR by proposing a Dagster-based, cloud-agnostic orchestration framework that integrates multiple Spark execution environments. By containerizing Dagster and designing core components (Context Injector, Message Reader, Cloud Client Innovations, Automation/Integration, Dynamic Cloud Client Factory), the approach aims for reproducible, scalable pipelines with reduced operational costs. Empirical results show a 12% performance gain over EMR and a 40% cost reduction versus Databricks, demonstrated on a Common Crawl–derived use case mapping interfirm networks. The work highlights architecture, implementation challenges, and platform trade-offs, offering a practical pathway to cost-efficient, vendor-neutral big data processing with improved prototyping and reproducibility.

Abstract

The rapid advancement of big data technologies has underscored the need for robust and efficient data processing solutions. Traditional Spark-based Platform-as-a-Service (PaaS) solutions, such as Databricks and Amazon Web Services Elastic MapReduce, provide powerful analytics capabilities but often result in high operational costs and vendor lock-in issues. These platforms, while user-friendly, can lead to significant inefficiencies due to their cost structures and lack of transparent pricing. This paper introduces a cost-effective and flexible orchestration framework using Dagster. Our solution aims to reduce dependency on any single PaaS provider by integrating various Spark execution environments. We demonstrate how Dagster's orchestration capabilities can enhance data processing efficiency, enforce best coding practices, and significantly reduce operational costs. In our implementation, we achieved a 12% performance improvement over EMR and a 40% cost reduction compared to DBR, translating to over 300 euros saved per pipeline run. Our goal is to provide a flexible, developer-controlled computing environment that maintains or improves performance and scalability while mitigating the risks associated with vendor lock-in. The proposed framework supports rapid prototyping and testing, which is essential for continuous development and operational efficiency, contributing to a more sustainable model of large data processing.
Paper Structure (10 sections, 6 figures, 1 table)

This paper contains 10 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Diagram orchestrator behavior.
  • Figure 2: Detailed dagster pipeline showcasing how execution environments can be chosen as needed between local, EMR and DBR.
  • Figure 3: Stacked Plot of Trail Runs by Platform.
  • Figure 4: Effort Needed for Implementing Each Platform Client.
  • Figure 5: Total Cost Production Runs by Asset.
  • ...and 1 more figures