Table of Contents
Fetching ...

Data Formats in Analytical DBMSs: Performance Trade-offs and Future Directions

Chunwei Liu, Anna Pavlenko, Matteo Interlandi, Brandon Haynes

TL;DR

It is found that each format has trade-offs that make it more or less suitable for use as a format in a DBMS and opportunities to more holistically co-design a unified in-memory and on-disk data representation are identified.

Abstract

This paper evaluates the suitability of Apache Arrow, Parquet, and ORC as formats for subsumption in an analytical DBMS. We systematically identify and explore the high-level features that are important to support efficient querying in modern OLAP DBMSs and evaluate the ability of each format to support these features. We find that each format has trade-offs that make it more or less suitable for use as a format in a DBMS and identify opportunities to more holistically co-design a unified in-memory and on-disk data representation. Notably, for certain popular machine learning tasks, none of these formats perform optimally, highlighting significant opportunities for advancing format design. Our hope is that this study can be used as a guide for system developers designing and using these formats, as well as provide the community with directions to pursue for improving these common open formats.

Data Formats in Analytical DBMSs: Performance Trade-offs and Future Directions

TL;DR

It is found that each format has trade-offs that make it more or less suitable for use as a format in a DBMS and opportunities to more holistically co-design a unified in-memory and on-disk data representation are identified.

Abstract

This paper evaluates the suitability of Apache Arrow, Parquet, and ORC as formats for subsumption in an analytical DBMS. We systematically identify and explore the high-level features that are important to support efficient querying in modern OLAP DBMSs and evaluate the ability of each format to support these features. We find that each format has trade-offs that make it more or less suitable for use as a format in a DBMS and identify opportunities to more holistically co-design a unified in-memory and on-disk data representation. Notably, for certain popular machine learning tasks, none of these formats perform optimally, highlighting significant opportunities for advancing format design. Our hope is that this study can be used as a guide for system developers designing and using these formats, as well as provide the community with directions to pursue for improving these common open formats.

Paper Structure

This paper contains 38 sections, 32 figures, 6 tables.

Figures (32)

  • Figure 1: Columnar format layout.
  • Figure 2: A Parquet row batch.
  • Figure 3: An ORC row batch.
  • Figure 4: Ratio of number of distinct values (#Distinct) to the number of rows (#Rows) in the CodecDB, Public BI and JOB datasets. The spikes for integer types near $D/N=1$ in the CODEC and JOB datasets occur because of primary key columns, which contain no duplicate values.
  • Figure 5: Compression ratios on the CodecDB real-world datasets with ${\sim}18k$ columns. The figures show the effective compression ratio (CR) in the range $[0,1]$. The CDF lines do not always reach $1.0$ at $CR=1.0$ because of underperforming compression on some columns.
  • ...and 27 more figures