Repairing Property Graphs under PG-Constraints
Christopher Spinrath, Angela Bonifati, Rachid Echahed
TL;DR
The paper tackles repairing property graphs under PG-Constraints by developing RGPC, a practical subset of RGPC-GPC constraints that support recursion and automata-based reasoning. It proposes a holistic repair pipeline that can delete nodes, edges, or labels, and compares three repair strategies: ILP, naive greedy, and LP-guided greedy, showing that label deletions can drastically reduce the total number of deletions (up to 59%) and that the LP-guided greedy often matches ILP quality with major runtime savings (up to 97%). Optional steps enable label deletions and neighborhood-based refinements to further balance repair quality and performance. The approach is validated on real-world datasets, including an investigative journalism graph, demonstrating effective repairs with practical gains in both accuracy and efficiency. Overall, the work provides a rigorous framework for constraint-aware graph repair and offers scalable algorithms with practical impact for ensuring data integrity in property graphs.
Abstract
Recent standardization efforts for graph databases lead to standard query languages like GQL and SQL/PGQ, and constraint languages like Property Graph Constraints (PG-Constraints). In this paper, we embark on the study of repairing property graphs under PG-Constraints. We identify a significant subset of PG-Constraints, encoding denial constraints and including recursion as a key feature, while still permitting automata-based structural analyses of errors. We present a comprehensive repair pipeline for these constraints to repair Property Graphs, involving changes in the graph topology and leading to node, edge and, optionally, label deletions. We investigate three algorithmic strategies for the repair procedure, based on Integer Linear Programming (ILP), a naive, and an LP-guided greedy algorithm. Our experiments on various real-world datasets reveal that repairing with label deletions can achieve a 59% reduction in deletions compared to node/edge deletions. Moreover, the LP-guided greedy algorithm offers a runtime advantage of up to 97% compared to the ILP strategy, while matching the same quality.
