Table of Contents
Fetching ...

Do GPUs Really Need New Tabular File Formats?

Jigao Luo, Qi Chen, Carsten Binnig

TL;DR

Parquet dominates analytical storage but its CPU-oriented defaults hinder GPU scans. The authors systematically measure how Parquet configuration affects GPU read performance and introduce a GPU-aware Parquet rewriter. They identify four actionable adjustments—increase page count, enlarge RG size to ~10M rows, enable encoding flexibility, and avoid unnecessary compression—that yield up to 125 GB/s effective bandwidth when reading from SSDs via GPUDirect Storage. Importantly, the work demonstrates substantial GPU acceleration within the Parquet format itself, establishing a practical baseline for GPU-friendly configurations without resorting to new file formats. This provides a concrete, hardware-aware pathway to optimize Parquet in GPU databases and related systems.

Abstract

Parquet is the de facto columnar file format in modern analytical systems, yet its configuration guidelines have largely been shaped by CPU-centric execution models. As GPU-accelerated data processing becomes increasingly prevalent, Parquet files generated with CPU-oriented defaults can severely underutilize GPU parallelism, turning GPU scans into a performance bottleneck. In this work, we systematically study how Parquet configurations affect GPU scan performance. We show that Parquet's poor GPU performance is not inherent to the format itself but rather a consequence of suboptimal configuration choices. By applying GPU-aware configurations, we increase effective read bandwidth up to 125 GB/s without modifying the Parquet specification.

Do GPUs Really Need New Tabular File Formats?

TL;DR

Parquet dominates analytical storage but its CPU-oriented defaults hinder GPU scans. The authors systematically measure how Parquet configuration affects GPU read performance and introduce a GPU-aware Parquet rewriter. They identify four actionable adjustments—increase page count, enlarge RG size to ~10M rows, enable encoding flexibility, and avoid unnecessary compression—that yield up to 125 GB/s effective bandwidth when reading from SSDs via GPUDirect Storage. Importantly, the work demonstrates substantial GPU acceleration within the Parquet format itself, establishing a practical baseline for GPU-friendly configurations without resorting to new file formats. This provides a concrete, hardware-aware pathway to optimize Parquet in GPU databases and related systems.

Abstract

Parquet is the de facto columnar file format in modern analytical systems, yet its configuration guidelines have largely been shaped by CPU-centric execution models. As GPU-accelerated data processing becomes increasingly prevalent, Parquet files generated with CPU-oriented defaults can severely underutilize GPU parallelism, turning GPU scans into a performance bottleneck. In this work, we systematically study how Parquet configurations affect GPU scan performance. We show that Parquet's poor GPU performance is not inherent to the format itself but rather a consequence of suboptimal configuration choices. By applying GPU-aware configurations, we increase effective read bandwidth up to 125 GB/s without modifying the Parquet specification.
Paper Structure (4 sections, 3 figures)

This paper contains 4 sections, 3 figures.

Figures (3)

  • Figure 1: GPU Parquet scan on TPC-H SF300 lineitem with 4 SSDs: file configuration impact on effective read bandwidth.
  • Figure 2: GPU Parquet scan on TPC-H SF300 lineitem with one SSD: storage bus bandwidth of different file configurations. Left: varying page counts. Right: varying rows per RG.
  • Figure 3: GPU Parquet scan on TPC-H SF300 lineitem: file configuration and SSD scaling effects on effective bandwidth.