Representation Learning for Frequent Subgraph Mining

Rex Ying; Tianyu Fu; Andrew Wang; Jiaxuan You; Yu Wang; Jure Leskovec

Representation Learning for Frequent Subgraph Mining

Rex Ying, Tianyu Fu, Andrew Wang, Jiaxuan You, Yu Wang, Jure Leskovec

TL;DR

Frequent subgraph mining is computationally hard due to NP-hard subgraph counting and the exponential search space of motifs. SPMiner addresses this by learning an order-embedding space that captures subgraph relations via a graph neural network and performing a monotonic walk-based motif search to identify high-frequency motifs efficiently. The approach combines a train-on-synthetic-data encoder with a flexible search decoder (greedy, beam, or MCTS) to discover large motifs that outperform baselines in accuracy and speed, including motifs of size up to 20 and above. Empirical results show near-perfect recovery for small motifs, robust identification of planted large motifs, and 10–100x higher frequencies than baselines on real-world datasets, highlighting strong generalization and practical impact across biology, chemistry, and network science. The work introduces a general, neural framework for motif discovery that can generalize across domains via synthetic pretraining and efficient in-embedding-space search.

Abstract

Identifying frequent subgraphs, also called network motifs, is crucial in analyzing and predicting properties of real-world networks. However, finding large commonly-occurring motifs remains a challenging problem not only due to its NP-hard subroutine of subgraph counting, but also the exponential growth of the number of possible subgraphs patterns. Here we present Subgraph Pattern Miner (SPMiner), a novel neural approach for approximately finding frequent subgraphs in a large target graph. SPMiner combines graph neural networks, order embedding space, and an efficient search strategy to identify network subgraph patterns that appear most frequently in the target graph. SPMiner first decomposes the target graph into many overlapping subgraphs and then encodes each subgraph into an order embedding space. SPMiner then uses a monotonic walk in the order embedding space to identify frequent motifs. Compared to existing approaches and possible neural alternatives, SPMiner is more accurate, faster, and more scalable. For 5- and 6-node motifs, we show that SPMiner can almost perfectly identify the most frequent motifs while being 100x faster than exact enumeration methods. In addition, SPMiner can also reliably identify frequent 10-node motifs, which is well beyond the size limit of exact enumeration approaches. And last, we show that SPMiner can find large up to 20 node motifs with 10-100x higher frequency than those found by current approximate methods.

Representation Learning for Frequent Subgraph Mining

TL;DR

Abstract

Paper Structure (20 sections, 4 theorems, 7 equations, 11 figures, 5 tables)

This paper contains 20 sections, 4 theorems, 7 equations, 11 figures, 5 tables.

Introduction
Related Work
Proposed Method
Problem Setup
SPMiner Encoder $\phi$: Embedding Candidate Subgraphs
SPMiner Decoder: Motif Search Procedure
Runtime and Memory Analysis
SPMiner Expressive Power
Synthetic Graph Pretraining
Experiments
Experimental setup
Results
Limitations
Conclusion
Model Analysis
...and 5 more sections

Key Result

Proposition 1

Given an order embedding encoder GNN $\phi$, let a graph generation procedure be $\{G_0, G_1, \ldots, G_{k-1}\}$, where at any step $i$, $G_i$ is generated by adding 1 node to $G_{i-1}$. Then the sequence $\{\phi(G_0), \phi(G_1), \ldots, \phi(G_{k-1})\}$ is a monotonic walk in the order embedding sp

Figures (11)

Figure 1: SPMiner encoder (a) and SPMiner motif search procedure (b). (a) The SPMiner decomposes a dataset into many node-anchored neighborhoods, and maps each neighborhood into a point in the embedding space such that order embedding property is preserved: if neighborhood $A$ is a subgraph of neighborhood $B$ then $A$ is embedded to the lower left of $B$. Here yellow node-anchored neighborhood is a subgraph of both blue and red neighborhoods, so it is embedding to the lower left of both of them. (b) SPMiner then starts with an empty graph and iteratively adds nodes and edges to it to find frequent motifs. SPMiner performs a monotonic walk in the order embedding space to identify a motif that is a subgraph of many neighborhoods. The walk in red represents growing of a frequent motif. Key insight here is that SPMiner can quickly count the number of occurrences of a given motif by simply checking the number of neighborhoods (points) that are embedded to the top-right of it (denoted with a shaded region).
Figure 2: Distinction between Node-anchored and Graph-level subgraph frequency. Consider a hub node with degree 100 (left). We aim to determine frequency of the star motif (right). Definition 1 with center anchor results in a count of $1$. In contrast, Definition 2 counts in ${100\choose 6}$ motif occurrences.
Figure 3: SPMiner Learnable skip layer. (a) Initially all skip connections are assigned equal weights. (b) After training the learnable skip GNN, the model learns the best skip connection configurations that encode subgraph relations. The architecture only requires $O(L^2)$ additional paramters.
Figure 4: Graph statistics of the synthetic and real-world graph datasets. Each point represents the statistics of one graph; the color of a point represents the dataset that the graph belongs to.
Figure 5: SPMiner vs. approximate frequent subgraph mining techniques: Among size-6 motifs, SPMiner is able to correctly identify the top $K$ most frequent motifs more accurately than baselines (left). Furthermore, the top 10 motifs identified by SPMiner have higher frequency than those found by baselines, for size 5 (middle) and size 6 (right) motifs. The blue dotted line represents the frequency of the groundtruth most frequent motifs.
...and 6 more figures

Theorems & Definitions (8)

Definition 1
Definition 2
Definition 3
Proposition 1
Proposition 2
Proposition 3
Proposition 4
proof

Representation Learning for Frequent Subgraph Mining

TL;DR

Abstract

Representation Learning for Frequent Subgraph Mining

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (8)