Table of Contents
Fetching ...

FaaS and Furious: abstractions and differential caching for efficient data pre-processing

Jacopo Tagliabue, Ryan Curtin, Ciro Greco

TL;DR

A novel programming model for pipelines in a data lakehouse is introduced, allowing users to interact declaratively with assets in object storage, and a columnar and differential cache is exploited to maximize iteration speed for data scientists.

Abstract

Data pre-processing pipelines are the bread and butter of any successful AI project. We introduce a novel programming model for pipelines in a data lakehouse, allowing users to interact declaratively with assets in object storage. Motivated by real-world industry usage patterns, we exploit these new abstractions with a columnar and differential cache to maximize iteration speed for data scientists, who spent most of their time in pre-processing - adding or removing features, restricting or relaxing time windows, wrangling current or older datasets. We show how the new cache works transparently across programming languages, schemas and time windows, and provide preliminary evidence on its efficiency on standard data workloads.

FaaS and Furious: abstractions and differential caching for efficient data pre-processing

TL;DR

A novel programming model for pipelines in a data lakehouse is introduced, allowing users to interact declaratively with assets in object storage, and a columnar and differential cache is exploited to maximize iteration speed for data scientists.

Abstract

Data pre-processing pipelines are the bread and butter of any successful AI project. We introduce a novel programming model for pipelines in a data lakehouse, allowing users to interact declaratively with assets in object storage. Motivated by real-world industry usage patterns, we exploit these new abstractions with a columnar and differential cache to maximize iteration speed for data scientists, who spent most of their time in pre-processing - adding or removing features, restricting or relaxing time windows, wrangling current or older datasets. We show how the new cache works transparently across programming languages, schemas and time windows, and provide preliminary evidence on its efficiency on standard data workloads.

Paper Structure

This paper contains 9 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: A sample multi-language, cloud data pipeline. The pipeline takes raw data in object storage (S3) to a final training dataset, by going through intermediate steps that wrangle dataframes into progressively cleaner data assets.
  • Figure 2: High-level communication flow between users and the cloud platform. 1) user requests a DAG execution, 2) the control plain sends a physical plan to a cloud worker, 3) the worker fetches data from object storage and 4) returns the log messages and the tuples back to the user.
  • Figure 3: The physical plan for cleaned_data.. A system function performing scans over S3 is added automatically before the user code: this decoupling shields users from data management and allows the addition of a data cache (purple, Section \ref{['sec:design']}).
  • Figure 4: Differential, language-agnostic scans for workloads (1)-(3) (left to right).: logical representation of the dataframes based on user code (top); S3 scans to download dataframe fragments (middle, request #3 requires no scan); physical dataframes as assembled from fragments (bottom).