Do GPUs Really Need New Tabular File Formats?
Jigao Luo, Qi Chen, Carsten Binnig
TL;DR
Parquet dominates analytical storage but its CPU-oriented defaults hinder GPU scans. The authors systematically measure how Parquet configuration affects GPU read performance and introduce a GPU-aware Parquet rewriter. They identify four actionable adjustments—increase page count, enlarge RG size to ~10M rows, enable encoding flexibility, and avoid unnecessary compression—that yield up to 125 GB/s effective bandwidth when reading from SSDs via GPUDirect Storage. Importantly, the work demonstrates substantial GPU acceleration within the Parquet format itself, establishing a practical baseline for GPU-friendly configurations without resorting to new file formats. This provides a concrete, hardware-aware pathway to optimize Parquet in GPU databases and related systems.
Abstract
Parquet is the de facto columnar file format in modern analytical systems, yet its configuration guidelines have largely been shaped by CPU-centric execution models. As GPU-accelerated data processing becomes increasingly prevalent, Parquet files generated with CPU-oriented defaults can severely underutilize GPU parallelism, turning GPU scans into a performance bottleneck. In this work, we systematically study how Parquet configurations affect GPU scan performance. We show that Parquet's poor GPU performance is not inherent to the format itself but rather a consequence of suboptimal configuration choices. By applying GPU-aware configurations, we increase effective read bandwidth up to 125 GB/s without modifying the Parquet specification.
