Table of Contents
Fetching ...

Zerrow: True Zero-Copy Arrow Pipelines in Bauplan

Yifan Dai, Jacopo Tagliabue, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Tyler R. Caraza-Harter

TL;DR

This work tackles the gap between Arrow's theoretical zero-copy property and practical data movement overheads in Arrow-based DAGs on Bauplan. It introduces Zerrow, a kernel-assisted data-pipeline platform that de-anonymizes anonymous memory, extends the Arrow IPC protocol with references to in-memory files (SIPC), and centralizes deserialization (DeCache) to avoid duplication across DAGs; combined with an adaptive admission/eviction policy, Zerrow enables true zero-copy-like behavior for writer-side and input-output paths. Key contributions include the KernelZero de-anonymization module, the SIPC extension, the DeCache deserialization cache, the node sandbox architecture, and a resource-manager-driven eviction strategy. Preliminary evaluations show substantial gains: up to 1.5–9.5x overall throughput improvements, 2.8x throughput increases for multi-DAG sharing of inputs, and notable reductions in memory usage through resharing and dictionary-based sharing. The approach has practical impact by enabling more DAG nodes to run in parallel on the same hardware, reducing latency and memory footprint for data pipelines built on Arrow in a FaaS lakehouse setting.

Abstract

Bauplan is a FaaS-based lakehouse specifically built for data pipelines: its execution engine uses Apache Arrow for data passing between the nodes in the DAG. While Arrow is known as the "zero copy format", in practice, limited Linux kernel support for shared memory makes it difficult to avoid copying entirely. In this work, we introduce several new techniques to eliminate nearly all copying from pipelines: in particular, we implement a new kernel module that performs de-anonymization, thus eliminating a copy to intermediate data. We conclude by sharing our preliminary evaluation on different workloads types, as well as discussing our plan for future improvements.

Zerrow: True Zero-Copy Arrow Pipelines in Bauplan

TL;DR

This work tackles the gap between Arrow's theoretical zero-copy property and practical data movement overheads in Arrow-based DAGs on Bauplan. It introduces Zerrow, a kernel-assisted data-pipeline platform that de-anonymizes anonymous memory, extends the Arrow IPC protocol with references to in-memory files (SIPC), and centralizes deserialization (DeCache) to avoid duplication across DAGs; combined with an adaptive admission/eviction policy, Zerrow enables true zero-copy-like behavior for writer-side and input-output paths. Key contributions include the KernelZero de-anonymization module, the SIPC extension, the DeCache deserialization cache, the node sandbox architecture, and a resource-manager-driven eviction strategy. Preliminary evaluations show substantial gains: up to 1.5–9.5x overall throughput improvements, 2.8x throughput increases for multi-DAG sharing of inputs, and notable reductions in memory usage through resharing and dictionary-based sharing. The approach has practical impact by enabling more DAG nodes to run in parallel on the same hardware, reducing latency and memory footprint for data pipelines built on Arrow in a FaaS lakehouse setting.

Abstract

Bauplan is a FaaS-based lakehouse specifically built for data pipelines: its execution engine uses Apache Arrow for data passing between the nodes in the DAG. While Arrow is known as the "zero copy format", in practice, limited Linux kernel support for shared memory makes it difficult to avoid copying entirely. In this work, we introduce several new techniques to eliminate nearly all copying from pipelines: in particular, we implement a new kernel module that performs de-anonymization, thus eliminating a copy to intermediate data. We conclude by sharing our preliminary evaluation on different workloads types, as well as discussing our plan for future improvements.

Paper Structure

This paper contains 25 sections, 10 figures, 1 table.

Figures (10)

  • Figure 1: Communication: Degrees of Zero Copy
  • Figure 2: Latency with Copy Avoidance
  • Figure 3: Zerrow Architecture and Write Path
  • Figure 4: Copy Avoidance. Throughput and swapping are show with and without KernelZero for a single-node DAG.
  • Figure 5: Performance of DAGs with the same Inputs. X-axis is the number of parallel executions and y-axis is the throughput in (a) and the number of foreground swap-in events in million times in (b). Baseline crashes at x$>$20 because of OOM.
  • ...and 5 more figures