Table of Contents
Fetching ...

PG-HIVE: Hybrid Incremental Schema Discovery for Property Graphs

Sofia Sideri, Georgia Troullinou, Elisjana Ymeralli, Vasilis Efthymiou, Dimitris Plexousakis, Haridimos Kondylakis

TL;DR

PG-HIVE tackles automatic schema discovery for property graphs without relying on explicit schemas. It combines Word2Vec-based representations with Locality-Sensitive Hashing to cluster nodes and edges, then merges clusters into node/edge types, infers property constraints and datatypes, and estimates cardinalities, all in an incremental, batch-oriented fashion. The approach outperforms state-of-the-art baselines in accuracy (up to 65% node and 40% edge gains) and efficiency (up to 1.95x faster than SchemI), especially under noise and with incomplete labeling. It also provides schema serialization in XSD and PG-Schema formats and guarantees monotone, information-preserving updates as data evolves.

Abstract

Property graphs have rapidly become the de facto standard for representing and managing complex, interconnected data, powering applications across domains from knowledge graphs to social networks. Despite the advantages, their schema-free nature poses major challenges for integration, exploration, visualization, and efficient querying. To bridge this gap, we present PG-HIVE, a novel framework for automatic schema discovery in property graphs. PG-HIVE goes beyond existing approaches by uncovering latent node and edge types, inferring property datatypes, constraints, and cardinalities, and doing so even in the absence of explicit labeling information. Leveraging a unique combination of Locality-Sensitive Hashing with property- and label-based clustering, PG-HIVE identifies structural similarities at scale. Moreover, it introduces incremental schema discovery, eliminating costly recomputation as new data arrives. Through extensive experimentation, we demonstrate that PG-HIVE consistently outperforms state-of-the-art solutions, in both accuracy (by up to 65% for nodes and 40% for edges), and efficiency (up to 1.95x faster execution), unlocking the full potential of schema-aware property graph management.

PG-HIVE: Hybrid Incremental Schema Discovery for Property Graphs

TL;DR

PG-HIVE tackles automatic schema discovery for property graphs without relying on explicit schemas. It combines Word2Vec-based representations with Locality-Sensitive Hashing to cluster nodes and edges, then merges clusters into node/edge types, infers property constraints and datatypes, and estimates cardinalities, all in an incremental, batch-oriented fashion. The approach outperforms state-of-the-art baselines in accuracy (up to 65% node and 40% edge gains) and efficiency (up to 1.95x faster than SchemI), especially under noise and with incomplete labeling. It also provides schema serialization in XSD and PG-Schema formats and guarantees monotone, information-preserving updates as data evolves.

Abstract

Property graphs have rapidly become the de facto standard for representing and managing complex, interconnected data, powering applications across domains from knowledge graphs to social networks. Despite the advantages, their schema-free nature poses major challenges for integration, exploration, visualization, and efficient querying. To bridge this gap, we present PG-HIVE, a novel framework for automatic schema discovery in property graphs. PG-HIVE goes beyond existing approaches by uncovering latent node and edge types, inferring property datatypes, constraints, and cardinalities, and doing so even in the absence of explicit labeling information. Leveraging a unique combination of Locality-Sensitive Hashing with property- and label-based clustering, PG-HIVE identifies structural similarities at scale. Moreover, it introduces incremental schema discovery, eliminating costly recomputation as new data arrives. Through extensive experimentation, we demonstrate that PG-HIVE consistently outperforms state-of-the-art solutions, in both accuracy (by up to 65% for nodes and 40% for edges), and efficiency (up to 1.95x faster execution), unlocking the full potential of schema-aware property graph management.

Paper Structure

This paper contains 14 sections, 2 theorems, 4 equations, 8 figures, 2 tables, 2 algorithms.

Key Result

Lemma 1

Let $T_{N1}=(\mathcal{L}_1,\\\mathcal{K}_1)$, $T_{N2}=(\mathcal{L}_2,\mathcal{K}_2)$, and $T_{NM}=(\mathcal{L}_1\cup\mathcal{L}_2,\;\mathcal{K}_1\cup\mathcal{K}_2)$ be the merge of $T_{N1}$ and $T_{N2}$. Then, $\mathcal{K}_i\subseteq \mathcal{K}_M$ and $\mathcal{L}_i\subseteq \mathcal{L}_M$ for $i\i

Figures (8)

  • Figure 1: Example Property Graph.
  • Figure 2: PG-HIVE process.
  • Figure 3: Statistical significance analysis of F1-scores across datasets for nodes (top) and edges (bottom) --GMM does not produce edge types.
  • Figure 4: F1 scores across all noise levels (0-40%) and label availability (0-50-100%).
  • Figure 5: Execution time until type discovery on each dataset across different noise percentages (0% - 40%).
  • ...and 3 more figures

Theorems & Definitions (16)

  • Definition 3.1: Property Graph DBLP:journals/pacmmod/AnglesBD0GHLLMM23
  • Definition 3.2: Node Type
  • Definition 3.3: Edge Type
  • Definition 3.4: Schema Graph
  • Example 1
  • Definition 3.5: Node Pattern
  • Definition 3.6: Edge Pattern
  • Example 2
  • Example 3
  • Example 4
  • ...and 6 more