Table of Contents
Fetching ...

Cardinality Estimation of Subgraph Matching: A Filtering-Sampling Approach

Wonseok Shin, Siwoo Song, Kunsoo Park, Wook-Shin Han

TL;DR

FaSTest tackles subgraph cardinality estimation by integrating a Filtering-Sampling framework that drastically reduces the sampling space while preserving all embeddings. It introduces novel safety conditions (Triangle Safety, Four-Cycle Safety, Edge Bipartite Safety) and a Promising-First refinement to build a compact Candidate Space, enabling efficient uniform sampling of candidate trees. When tree-based sampling is insufficient, FaSTest applies a worst-case optimal stratified graph sampling strategy with a provably bounded complexity of $O(AGM(q))$, achieving high accuracy on hard instances. Empirical results across diverse real-world datasets show FaSTest outperforming state-of-the-art sampling methods by up to two orders of magnitude in accuracy and beating GNN-based approaches by up to three orders, with reasonable indexing and memory cost, making large-scale subgraph cardinality estimation practical.

Abstract

Subgraph counting is a fundamental problem in understanding and analyzing graph structured data, yet computationally challenging. This calls for an accurate and efficient algorithm for Subgraph Cardinality Estimation, which is to estimate the number of all isomorphic embeddings of a query graph in a data graph. We present FaSTest, a novel algorithm that combines (1) a powerful filtering technique to significantly reduce the sample space, (2) an adaptive tree sampling algorithm for accurate and efficient estimation, and (3) a worst-case optimal stratified graph sampling algorithm for difficult instances. Extensive experiments on real-world datasets show that FaSTest outperforms state-of-the-art sampling-based methods by up to two orders of magnitude and GNN-based methods by up to three orders of magnitude in terms of accuracy.

Cardinality Estimation of Subgraph Matching: A Filtering-Sampling Approach

TL;DR

FaSTest tackles subgraph cardinality estimation by integrating a Filtering-Sampling framework that drastically reduces the sampling space while preserving all embeddings. It introduces novel safety conditions (Triangle Safety, Four-Cycle Safety, Edge Bipartite Safety) and a Promising-First refinement to build a compact Candidate Space, enabling efficient uniform sampling of candidate trees. When tree-based sampling is insufficient, FaSTest applies a worst-case optimal stratified graph sampling strategy with a provably bounded complexity of , achieving high accuracy on hard instances. Empirical results across diverse real-world datasets show FaSTest outperforming state-of-the-art sampling methods by up to two orders of magnitude in accuracy and beating GNN-based approaches by up to three orders, with reasonable indexing and memory cost, making large-scale subgraph cardinality estimation practical.

Abstract

Subgraph counting is a fundamental problem in understanding and analyzing graph structured data, yet computationally challenging. This calls for an accurate and efficient algorithm for Subgraph Cardinality Estimation, which is to estimate the number of all isomorphic embeddings of a query graph in a data graph. We present FaSTest, a novel algorithm that combines (1) a powerful filtering technique to significantly reduce the sample space, (2) an adaptive tree sampling algorithm for accurate and efficient estimation, and (3) a worst-case optimal stratified graph sampling algorithm for difficult instances. Extensive experiments on real-world datasets show that FaSTest outperforms state-of-the-art sampling-based methods by up to two orders of magnitude and GNN-based methods by up to three orders of magnitude in terms of accuracy.
Paper Structure (64 sections, 16 theorems, 28 equations, 10 figures, 5 tables, 6 algorithms)

This paper contains 64 sections, 16 theorems, 28 equations, 10 figures, 5 tables, 6 algorithms.

Key Result

Theorem 4.11

With stated stopping criteria, various safety conditions lead to the following time complexity bounds for the filtering.

Figures (10)

  • Figure 1: Query graphs and their number of embeddings in WordNet dataset ($\approx$ 80K vertices). N, V, and A represent Noun, Verb, and Adjective, respectively.
  • Figure 2: Data graph $G$, query graph $q$, and Candidate Space
  • Figure 3: Example of Triangle Safety
  • Figure 6: New Running Example - Candidate Tree Sampling
  • Figure 7: Example of Stratified Sampling
  • ...and 5 more figures

Theorems & Definitions (31)

  • Definition 4.1: Candidate Space
  • Example 4.2
  • Definition 4.3
  • Definition 4.3
  • Example 4.4
  • Definition 4.5
  • Definition 4.6
  • Example 4.7
  • Definition 4.8
  • Example 4.9
  • ...and 21 more