Table of Contents
Fetching ...

Discovering Graph Generating Dependencies for Property Graph Profiling

Larissa C. Shimomura, Nikolay Yakovets, George Fletcher

TL;DR

This work introduces GGDMiner, a framework for automatic discovery of approximate Graph Generating Dependencies (GGDs) in property graphs to support data profiling. GGDs relate source and target graph patterns with differential constraints, and the authors formalize support, confidence, coverage, and a decision boundary to guide discovery, including extension GGDs where the target extends the source. A novel Answer Graph representation enables memory-efficient matching and confidence computation, while a lattice-based Candidate Generation and a greedy Candidate Index maximize graph coverage with reduced redundancy. Experimental results on real and synthetic datasets demonstrate high graph coverage and favorable performance, with notable improvements from the Answer Graph and competitive comparison to AMIE+. The work provides a practical baseline for graph schema discovery and data profiling with promising scalability and avenues for parallelization.

Abstract

With the increasing use of graph-structured data, there is also increasing interest in investigating graph data dependencies and their applications, e.g., in graph data profiling. Graph Generating Dependencies (GGDs) are a class of dependencies for property graphs that can express the relation between different graph patterns and constraints based on their attribute similarities. Rich syntax and semantics of GGDs make them a good candidate for graph data profiling. Nonetheless, GGDs are difficult to define manually, especially when there are no data experts available. In this paper, we propose GGDMiner, a framework for discovering approximate GGDs from graph data automatically, with the intention of profiling graph data through GGDs for the user. GGDMiner has three main steps: (1) pre-processing, (2) candidate generation, and, (3) GGD extraction. To optimize memory consumption and execution time, GGDMiner uses a factorized representation of each discovered graph pattern, called Answer Graph. Our results show that the discovered set of GGDs can give an overview about the input graph, both schema level information and also correlations between the graph patterns and attributes.

Discovering Graph Generating Dependencies for Property Graph Profiling

TL;DR

This work introduces GGDMiner, a framework for automatic discovery of approximate Graph Generating Dependencies (GGDs) in property graphs to support data profiling. GGDs relate source and target graph patterns with differential constraints, and the authors formalize support, confidence, coverage, and a decision boundary to guide discovery, including extension GGDs where the target extends the source. A novel Answer Graph representation enables memory-efficient matching and confidence computation, while a lattice-based Candidate Generation and a greedy Candidate Index maximize graph coverage with reduced redundancy. Experimental results on real and synthetic datasets demonstrate high graph coverage and favorable performance, with notable improvements from the Answer Graph and competitive comparison to AMIE+. The work provides a practical baseline for graph schema discovery and data profiling with promising scalability and avenues for parallelization.

Abstract

With the increasing use of graph-structured data, there is also increasing interest in investigating graph data dependencies and their applications, e.g., in graph data profiling. Graph Generating Dependencies (GGDs) are a class of dependencies for property graphs that can express the relation between different graph patterns and constraints based on their attribute similarities. Rich syntax and semantics of GGDs make them a good candidate for graph data profiling. Nonetheless, GGDs are difficult to define manually, especially when there are no data experts available. In this paper, we propose GGDMiner, a framework for discovering approximate GGDs from graph data automatically, with the intention of profiling graph data through GGDs for the user. GGDMiner has three main steps: (1) pre-processing, (2) candidate generation, and, (3) GGD extraction. To optimize memory consumption and execution time, GGDMiner uses a factorized representation of each discovered graph pattern, called Answer Graph. Our results show that the discovered set of GGDs can give an overview about the input graph, both schema level information and also correlations between the graph patterns and attributes.
Paper Structure (28 sections, 3 equations, 8 figures, 2 tables, 1 algorithm)

This paper contains 28 sections, 3 equations, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: Example GGD
  • Figure 2: Example Graph Labels and Attributes and Important attributes selected
  • Figure 3: Similarity Clustering Index
  • Figure 4: Answer Graph Example
  • Figure 5: Level 1 of the lattice after vertical and horizontal expansion
  • ...and 3 more figures

Theorems & Definitions (3)

  • Example 1
  • Example 2
  • Example 3