Table of Contents
Fetching ...

ParquetDB: A Lightweight Python Parquet-Based Database

Logan Lang, Eduardo Hernandez, Kamal Choudhary, Aldo H. Romero

TL;DR

ParquetDB introduces a Python-based database framework that uses Apache Parquet and PyArrow to deliver efficient, columnar-storage–driven data management for complex and nested data. It eliminates heavy reliance on indexing by exploiting Parquet's metadata, predicate pushdown, and columnar encoding, while supporting schema evolution and data normalization. Through benchmarks against SQLite and MongoDB and a real-world Alexandria 3D Materials Database deployment, ParquetDB demonstrates strong update and read performance on large, nested datasets, with practical workflows for create, read, update, and delete operations. The work argues for ParquetDB as a scalable, portable alternative for data-intensive research and big-data pipelines, especially where rapid iteration and flexible data models are required.

Abstract

Traditional data storage formats and databases often introduce complexities and inefficiencies that hinder rapid iteration and adaptability. To address these challenges, we introduce ParquetDB, a Python-based database framework that leverages the Parquet file format's optimized columnar storage. ParquetDB offers efficient serialization and deserialization, native support for complex and nested data types, reduced dependency on indexing through predicate pushdown filtering, and enhanced portability due to its file-based storage system. Benchmarks show that ParquetDB outperforms traditional databases like SQLite and MongoDB in managing large volumes of data, especially when using data formats compatible with PyArrow. We validate ParquetDB's practical utility by applying it to the Alexandria 3D Materials Database, efficiently handling approximately 4.8 million complex and nested records. By addressing the inherent limitations of existing data storage systems and continuously evolving to meet future demands, ParquetDB has the potential to significantly streamline data management processes and accelerate research development in data-driven fields.

ParquetDB: A Lightweight Python Parquet-Based Database

TL;DR

ParquetDB introduces a Python-based database framework that uses Apache Parquet and PyArrow to deliver efficient, columnar-storage–driven data management for complex and nested data. It eliminates heavy reliance on indexing by exploiting Parquet's metadata, predicate pushdown, and columnar encoding, while supporting schema evolution and data normalization. Through benchmarks against SQLite and MongoDB and a real-world Alexandria 3D Materials Database deployment, ParquetDB demonstrates strong update and read performance on large, nested datasets, with practical workflows for create, read, update, and delete operations. The work argues for ParquetDB as a scalable, portable alternative for data-intensive research and big-data pipelines, especially where rapid iteration and flexible data models are required.

Abstract

Traditional data storage formats and databases often introduce complexities and inefficiencies that hinder rapid iteration and adaptability. To address these challenges, we introduce ParquetDB, a Python-based database framework that leverages the Parquet file format's optimized columnar storage. ParquetDB offers efficient serialization and deserialization, native support for complex and nested data types, reduced dependency on indexing through predicate pushdown filtering, and enhanced portability due to its file-based storage system. Benchmarks show that ParquetDB outperforms traditional databases like SQLite and MongoDB in managing large volumes of data, especially when using data formats compatible with PyArrow. We validate ParquetDB's practical utility by applying it to the Alexandria 3D Materials Database, efficiently handling approximately 4.8 million complex and nested records. By addressing the inherent limitations of existing data storage systems and continuously evolving to meet future demands, ParquetDB has the potential to significantly streamline data management processes and accelerate research development in data-driven fields.

Paper Structure

This paper contains 60 sections, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Illustration of the serialization and deserialization process for numerical data in CSV and JSON formats. The inefficient ASCII encoding requires multiple bytes for each character, leading to larger file sizes and slower data conversion. Conversion steps, represented by hourglass icons, introduce additional latency as numerical values must be transformed into optimized binary formats for computational use. This process highlights the performance bottleneck in traditional plain text file formats.
  • Figure 2: Comparison of search efficiency in an unordered list versus a B-Tree index. The unordered list requires 18 steps to locate the value 89 through a sequential scan, while the B-Tree index reduces this to only 6 steps by leveraging its hierarchical and balanced structure. This illustrates the significant performance improvement that indexing provides, particularly in large datasets, by minimizing the number of operations needed for data retrieval.
  • Figure 3: Parquet File Format Overview. This diagram illustrates the structure of a Parquet file, including Row Groups, Columns, Pages, and the Footer. The metadata associated with each level provides essential details, such as schema, offsets, compression sizes, encryption, and statistical summaries. These metadata components enable efficient data storage, retrieval, and filtering, making Parquet an ideal choice for analytics and big data processing.
  • Figure 4: Comparison of Storage Layouts. Row-Based, Column-Based, and Hybrid-Based (Row Group Size = 2). Parquet files utilize a hybrid storage layout, balancing the strengths of row-based and column-based storage by grouping rows together for efficient read and write operations, making them suitable for analytic queries on large datasets.
  • Figure 5: Benchmark Create and Read Times for Different Databases. Create time is plotted on the left y-axis, read time on the right y-axis, and the number of rows on the x-axis. A log plot is shown in the inset.
  • ...and 6 more figures