Table of Contents
Fetching ...

Bauplan: zero-copy, scale-up FaaS for data pipelines

Jacopo Tagliabue, Tyler Caraza-Harter, Ciro Greco

TL;DR

This paper introduces bauplan, a novel FaaS programming model and serverless runtime designed for data practitioners that achieves both better performance and a superior developer experience for data workloads by making the trade-off of reducing generality in favor of data-awareness.

Abstract

Chaining functions for longer workloads is a key use case for FaaS platforms in data applications. However, modern data pipelines differ significantly from typical serverless use cases (e.g., webhooks and microservices); this makes it difficult to retrofit existing pipeline frameworks due to structural constraints. In this paper, we describe these limitations in detail and introduce bauplan, a novel FaaS programming model and serverless runtime designed for data practitioners. bauplan enables users to declaratively define functional Directed Acyclic Graphs (DAGs) along with their runtime environments, which are then efficiently executed on cloud-based workers. We show that bauplan achieves both better performance and a superior developer experience for data workloads by making the trade-off of reducing generality in favor of data-awareness

Bauplan: zero-copy, scale-up FaaS for data pipelines

TL;DR

This paper introduces bauplan, a novel FaaS programming model and serverless runtime designed for data practitioners that achieves both better performance and a superior developer experience for data workloads by making the trade-off of reducing generality in favor of data-awareness.

Abstract

Chaining functions for longer workloads is a key use case for FaaS platforms in data applications. However, modern data pipelines differ significantly from typical serverless use cases (e.g., webhooks and microservices); this makes it difficult to retrofit existing pipeline frameworks due to structural constraints. In this paper, we describe these limitations in detail and introduce bauplan, a novel FaaS programming model and serverless runtime designed for data practitioners. bauplan enables users to declaratively define functional Directed Acyclic Graphs (DAGs) along with their runtime environments, which are then efficiently executed on cloud-based workers. We show that bauplan achieves both better performance and a superior developer experience for data workloads by making the trade-off of reducing generality in favor of data-awareness

Paper Structure

This paper contains 12 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: A DAG of dataframes produced by transformations: the source dataframe transactions is first filtered for European countries (generating euro_selection, then the aggregation of revenues by country is computed as usd_by_country.
  • Figure 2: End-to-end architecture: 1) A user requests to run a DAG; 2) the APIs parse the request and send an execution plan to a worker; 3) an existing (bin-packing) or on-demand worker runs the required operations over customer data inside the customer cloud; 4) print statements and data previews are streamed back to the user.
  • Figure 3: From logical plan to execution, with three levels of representation for every run: (top to bottom) 1) logical plan, obtained by parsing user code, 2) physical plan, obtained by adding system functions, 3) the actual worker-level execution, transparently managing data and package caches to further speed up execution.