Table of Contents
Fetching ...

Should I Hide My Duck in the Lake?

Jonas Dann, Gustavo Alonso

TL;DR

This work proposes a vision for a data processing SmartNIC for the cloud that sits on the network datapath of compute nodes to offload decoding and pushed-down operators, effectively hiding the cost of querying raw files.

Abstract

Data lakes spend a significant fraction of query execution time on scanning data from remote storage. Decoding alone accounts for 46% of runtime when running TPC-H directly on Parquet files. To address this bottleneck, we propose a vision for a data processing SmartNIC for the cloud that sits on the network datapath of compute nodes to offload decoding and pushed-down operators, effectively hiding the cost of querying raw files. Our experimental estimations with DuckDB suggest that by operating directly on pre-filtered data as delivered by a SmartNIC, significantly smaller CPUs can still match query throughput of traditional setups.

Should I Hide My Duck in the Lake?

TL;DR

This work proposes a vision for a data processing SmartNIC for the cloud that sits on the network datapath of compute nodes to offload decoding and pushed-down operators, effectively hiding the cost of querying raw files.

Abstract

Data lakes spend a significant fraction of query execution time on scanning data from remote storage. Decoding alone accounts for 46% of runtime when running TPC-H directly on Parquet files. To address this bottleneck, we propose a vision for a data processing SmartNIC for the cloud that sits on the network datapath of compute nodes to offload decoding and pushed-down operators, effectively hiding the cost of querying raw files. Our experimental estimations with DuckDB suggest that by operating directly on pre-filtered data as delivered by a SmartNIC, significantly smaller CPUs can still match query throughput of traditional setups.
Paper Structure (3 sections, 4 figures)

This paper contains 3 sections, 4 figures.

Figures (4)

  • Figure 1: DuckDB TPC-H throughput benchmark for Parquet-resident data, pre-loaded tables, and pre-filtered tables.
  • Figure 2: TPC-H per-query breakdown (scale factor 30).
  • Figure 3: TPC-H CSV and JSON throughput & effects of Parquet input ordering (scale factors 10 & 30, respectively).
  • Figure 4: Data processing SmartNIC architecture.