Efficient Partition-based Approaches for Diversified Top-k Subgraph Matching
Liuyi Chen, Yuchen Hu, Zhengyi Yang, Xu Zhou, Wenjie Zhang, Kenli Li
TL;DR
The paper tackles DT$k$SM, the problem of selecting $k$ isomorphic subgraph matches that maximize pairwise topological distances in a data graph. It introduces the Partition-based Distance Diversity (PDD) framework, which partitions the graph, computes inter-partition distances via a Partition Adjacency Graph (PAG), and selects dispersed partitions to drive diverse matching, with parallel HySM and optional inter-partition completion. Two optimizations—embedding-driven partition filtering and densest-based partition selection over a Partition Distance Graph (PDG)—drive efficiency and global dispersion, with theoretical and empirical support. Experiments on 12 datasets show substantial speedups (up to four orders of magnitude) and strong diversity performance (about 95% of cases near 80% of optimal distance diversity and full coverage diversity), indicating practical impact for scalable, globally diverse subgraph matching in domains such as biology, finance, and social networks.
Abstract
Subgraph matching is a core task in graph analytics, widely used in domains such as biology, finance, and social networks. Existing top-k diversified methods typically focus on maximizing vertex coverage, but often return results in the same region, limiting topological diversity. We propose the Distance-Diversified Top-k Subgraph Matching (DTkSM) problem, which selects k isomorphic matches with maximal pairwise topological distances to better capture global graph structure. To address its computational challenges, we introduce the Partition-based Distance Diversity (PDD) framework, which partitions the graph and retrieves diverse matches from distant regions. To enhance efficiency, we develop two optimizations: embedding-driven partition filtering and densest-based partition selection over a Partition Adjacency Graph. Experiments on 12 real world datasets show our approach achieves up to four orders of magnitude speedup over baselines, with 95% of results reaching 80% of optimal distance diversity and 100% coverage diversity.
