Table of Contents
Fetching ...

sAirflow: Adopting Serverless in a Legacy Workflow Scheduler

Filip Mikina, Pawel Zuk, Krzysztof Rzadca

TL;DR

This work demonstrates how to migrate a legacy workflow system (Airflow) to a serverless architecture by using Change Data Capture to drive an event-based control plane and by deploying executors on Function-as-a-Service and Container-as-a-Service. The sAirflow design preserves Airflow interfaces while achieving rapid horizontal scaling and reduced monetary costs compared to MWAA. Empirical results over synthetic and Alibaba-trace DAGs show that sAirflow delivers strong scaling for parallel workloads and notable cost advantages, though CDC latency and container startup overhead introduce measurable overheads for certain workflows. The work highlights practical challenges in serverless migrations and identifies CDC and serverless database integration as key areas for future improvement.

Abstract

Serverless clouds promise efficient scaling, reduced toil and monetary costs. Yet, serverless-ing a complex, legacy application might require major refactoring and thus is risky. As a case study, we use Airflow, an industry-standard workflow system. To reduce migration risk, we propose to limit code modifications by relying on change data capture (CDC) and message queues for internal communication. To achieve serverless efficiency, we rely on Function-as-a-Service (FaaS). Our system, sAirflow, is the first adaptation of the control plane and workers to the serverless cloud - and it maintains the same interface and most of the code. Experimentally, we show that sAirflow delivers the key serverless benefits: scaling and cost reduction. We compare sAirflow to MWAA, a managed (SaaS) Airflow. On Alibaba benchmarks on warm systems, sAirflow performs similarly while halving the monetary cost. On highly parallel workflows on cold systems, sAirflow scales out in seconds to 125 workers, reducing makespan by 2x-7x.

sAirflow: Adopting Serverless in a Legacy Workflow Scheduler

TL;DR

This work demonstrates how to migrate a legacy workflow system (Airflow) to a serverless architecture by using Change Data Capture to drive an event-based control plane and by deploying executors on Function-as-a-Service and Container-as-a-Service. The sAirflow design preserves Airflow interfaces while achieving rapid horizontal scaling and reduced monetary costs compared to MWAA. Empirical results over synthetic and Alibaba-trace DAGs show that sAirflow delivers strong scaling for parallel workloads and notable cost advantages, though CDC latency and container startup overhead introduce measurable overheads for certain workflows. The work highlights practical challenges in serverless migrations and identifies CDC and serverless database integration as key areas for future improvement.

Abstract

Serverless clouds promise efficient scaling, reduced toil and monetary costs. Yet, serverless-ing a complex, legacy application might require major refactoring and thus is risky. As a case study, we use Airflow, an industry-standard workflow system. To reduce migration risk, we propose to limit code modifications by relying on change data capture (CDC) and message queues for internal communication. To achieve serverless efficiency, we rely on Function-as-a-Service (FaaS). Our system, sAirflow, is the first adaptation of the control plane and workers to the serverless cloud - and it maintains the same interface and most of the code. Experimentally, we show that sAirflow delivers the key serverless benefits: scaling and cost reduction. We compare sAirflow to MWAA, a managed (SaaS) Airflow. On Alibaba benchmarks on warm systems, sAirflow performs similarly while halving the monetary cost. On highly parallel workflows on cold systems, sAirflow scales out in seconds to 125 workers, reducing makespan by 2x-7x.
Paper Structure (23 sections, 1 equation, 17 figures, 6 tables)

This paper contains 23 sections, 1 equation, 17 figures, 6 tables.

Figures (17)

  • Figure 1: sAirflow on AWS (icons' source: AWS)
  • Figure 2: Sample DAGs obtained from jobs in the Alibaba trace. Note that the last DAG is highly parallel (77 tasks in total, 76 of which run in parallel on start-up). Some of the tasks do not have a downstream dependency.
  • Figure 3: Parallel DAGs, $p=10$, $T=30$, $n=125$ (cold starts). sAirflow shortens the makespan by 7.2x (left). Gantt charts (middle, right) show a single run.
  • Figure 4: Warm system, $p=10$, $T=5$. The first DAG run is not reported.
  • Figure 5: Alibaba DAGs: makespans on all DAGs (left) and a detailed analysis on the three DAGs from Fig. \ref{['fig:example_dags']}
  • ...and 12 more figures