Zerrow: True Zero-Copy Arrow Pipelines in Bauplan
Yifan Dai, Jacopo Tagliabue, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Tyler R. Caraza-Harter
TL;DR
This work tackles the gap between Arrow's theoretical zero-copy property and practical data movement overheads in Arrow-based DAGs on Bauplan. It introduces Zerrow, a kernel-assisted data-pipeline platform that de-anonymizes anonymous memory, extends the Arrow IPC protocol with references to in-memory files (SIPC), and centralizes deserialization (DeCache) to avoid duplication across DAGs; combined with an adaptive admission/eviction policy, Zerrow enables true zero-copy-like behavior for writer-side and input-output paths. Key contributions include the KernelZero de-anonymization module, the SIPC extension, the DeCache deserialization cache, the node sandbox architecture, and a resource-manager-driven eviction strategy. Preliminary evaluations show substantial gains: up to 1.5–9.5x overall throughput improvements, 2.8x throughput increases for multi-DAG sharing of inputs, and notable reductions in memory usage through resharing and dictionary-based sharing. The approach has practical impact by enabling more DAG nodes to run in parallel on the same hardware, reducing latency and memory footprint for data pipelines built on Arrow in a FaaS lakehouse setting.
Abstract
Bauplan is a FaaS-based lakehouse specifically built for data pipelines: its execution engine uses Apache Arrow for data passing between the nodes in the DAG. While Arrow is known as the "zero copy format", in practice, limited Linux kernel support for shared memory makes it difficult to avoid copying entirely. In this work, we introduce several new techniques to eliminate nearly all copying from pipelines: in particular, we implement a new kernel module that performs de-anonymization, thus eliminating a copy to intermediate data. We conclude by sharing our preliminary evaluation on different workloads types, as well as discussing our plan for future improvements.
