Should I Hide My Duck in the Lake?

Jonas Dann; Gustavo Alonso

Should I Hide My Duck in the Lake?

Jonas Dann, Gustavo Alonso

TL;DR

This work proposes a vision for a data processing SmartNIC for the cloud that sits on the network datapath of compute nodes to offload decoding and pushed-down operators, effectively hiding the cost of querying raw files.

Abstract

Data lakes spend a significant fraction of query execution time on scanning data from remote storage. Decoding alone accounts for 46% of runtime when running TPC-H directly on Parquet files. To address this bottleneck, we propose a vision for a data processing SmartNIC for the cloud that sits on the network datapath of compute nodes to offload decoding and pushed-down operators, effectively hiding the cost of querying raw files. Our experimental estimations with DuckDB suggest that by operating directly on pre-filtered data as delivered by a SmartNIC, significantly smaller CPUs can still match query throughput of traditional setups.

Should I Hide My Duck in the Lake?

TL;DR

Abstract

Paper Structure (3 sections, 4 figures)

This paper contains 3 sections, 4 figures.

Introduction
Query Performance on Raw Data
Hiding Queries in the Datapath

Figures (4)

Figure 1: DuckDB TPC-H throughput benchmark for Parquet-resident data, pre-loaded tables, and pre-filtered tables.
Figure 2: TPC-H per-query breakdown (scale factor 30).
Figure 3: TPC-H CSV and JSON throughput & effects of Parquet input ordering (scale factors 10 & 30, respectively).
Figure 4: Data processing SmartNIC architecture.

Should I Hide My Duck in the Lake?

TL;DR

Abstract

Should I Hide My Duck in the Lake?

Authors

TL;DR

Abstract

Table of Contents

Figures (4)