Table of Contents
Fetching ...

GraphAr: An Efficient Storage Scheme for Graph Data in Data Lakes

Xue Li, Weibin Zeng, Zhibin Wang, Diwen Zhu, Jingbo Xu, Wenyuan Yu, Jingren Zhou

TL;DR

GraphAr is introduced, a specialized storage scheme designed to enhance existing data lakes for efficient graph data management and outperforms conventional Parquet and Acero-based methods, achieving an average speedup of 4452× for neighbor retrieval, 14.8× for label filtering, and 29.5× for end-to-end workloads.

Abstract

Data lakes, increasingly adopted for their ability to store and analyze diverse types of data, commonly use columnar storage formats like Parquet and ORC for handling relational tables. However, these traditional setups fall short when it comes to efficiently managing graph data, particularly those conforming to the Labeled Property Graph (LPG) model. To address this gap, this paper introduces GraphAr, a specialized storage scheme designed to enhance existing data lakes for efficient graph data management. Leveraging the strengths of Parquet, GraphAr captures LPG semantics precisely and facilitates graph-specific operations such as neighbor retrieval and label filtering. Through innovative data organization, encoding, and decoding techniques, GraphAr dramatically improves performance. Our evaluations reveal that GraphAr outperforms conventional Parquet and Acero-based methods, achieving an average speedup of 4452x for neighbor retrieval, 14.8x for label filtering, and 29.5x for end-to-end workloads. These findings highlight GraphAr's potential to extend the utility of data lakes by enabling efficient graph data management.

GraphAr: An Efficient Storage Scheme for Graph Data in Data Lakes

TL;DR

GraphAr is introduced, a specialized storage scheme designed to enhance existing data lakes for efficient graph data management and outperforms conventional Parquet and Acero-based methods, achieving an average speedup of 4452× for neighbor retrieval, 14.8× for label filtering, and 29.5× for end-to-end workloads.

Abstract

Data lakes, increasingly adopted for their ability to store and analyze diverse types of data, commonly use columnar storage formats like Parquet and ORC for handling relational tables. However, these traditional setups fall short when it comes to efficiently managing graph data, particularly those conforming to the Labeled Property Graph (LPG) model. To address this gap, this paper introduces GraphAr, a specialized storage scheme designed to enhance existing data lakes for efficient graph data management. Leveraging the strengths of Parquet, GraphAr captures LPG semantics precisely and facilitates graph-specific operations such as neighbor retrieval and label filtering. Through innovative data organization, encoding, and decoding techniques, GraphAr dramatically improves performance. Our evaluations reveal that GraphAr outperforms conventional Parquet and Acero-based methods, achieving an average speedup of 4452x for neighbor retrieval, 14.8x for label filtering, and 29.5x for end-to-end workloads. These findings highlight GraphAr's potential to extend the utility of data lakes by enabling efficient graph data management.
Paper Structure (27 sections, 1 theorem, 4 equations, 10 figures, 3 tables)

This paper contains 27 sections, 1 theorem, 4 equations, 10 figures, 3 tables.

Key Result

Theorem 1

Consider $k$ interval lists $P_0,P_1,\dots,P_{k-1}$, where the vertices in $[P_i[j],P_i[j+1])$ share the same value for the $i$-th label. If an iterval $[s,e)$ is not broken by any position, i.e., the vertices within the interval $[s,e)$ have the same labels, i.e.,

Figures (10)

  • Figure 1: A graph-related query within the data lake.
  • Figure 2: The internal structure of a Parquet file for a logical table with $C$ columns and $R$ row groups.
  • Figure 3: An example of querying LPGs on tabular formats.
  • Figure 4: The metadata and data layout for the example graph in GraphAr.
  • Figure 5: An example of PAC and its usage.
  • ...and 5 more figures

Theorems & Definitions (5)

  • Definition 1: Neighbor Retrieval
  • Definition 2: Page-aligned collections (PAC)
  • Definition 3: Simple Condition Filtering
  • Definition 4: Complex Condition Filtering
  • Theorem 1