Scalable Community Search with Accuracy Guarantee on Attributed Graphs

Yuxiang Wang; Shuzhan Ye; Xiaoliang Xu; Yuxia Geng; Zhenghe Zhao; Xiangyu Ke; Tianxing Wu

Scalable Community Search with Accuracy Guarantee on Attributed Graphs

Yuxiang Wang, Shuzhan Ye, Xiaoliang Xu, Yuxia Geng, Zhenghe Zhao, Xiangyu Ke, Tianxing Wu

TL;DR

This work addresses the scalable Community Search over Attributed Graphs (CS-AG) problem, which seeks a connected $k$-core containing a query node while optimizing a $q$-centric attribute distance that combines textual and numerical attributes. It proves NP-hardness and delivers two complementary solutions: an exact baseline with three pruning strategies and an index-free sampling-estimation method that provides a runtime confidence interval guarantee on the attribute cohesiveness via Bag of Little Bootstrap and Hoeffding-based sampling. The approximate method can early-terminate when a user-specified relative error bound is met and extends to heterogeneous graphs, size-bounded settings, and alternative community models. Extensive experiments on ten real-world datasets show substantial speedups (often orders of magnitude) and reliable accuracy, validating the practicality of the approach for large-scale attributed graphs.

Abstract

Given an attributed graph $G$ and a query node $q$, \underline{C}ommunity \underline{S}earch over \underline{A}ttributed \underline{G}raphs (CS-AG) aims to find a structure- and attribute-cohesive subgraph from $G$ that contains $q$. Although CS-AG has been widely studied, they still face three challenges. (1) Exact methods based on graph traversal are time-consuming, especially for large graphs. Some tailored indices can improve efficiency, but introduce nonnegligible storage and maintenance overhead. (2) Approximate methods with a loose approximation ratio only provide a coarse-grained evaluation of a community's quality, rather than a reliable evaluation with an accuracy guarantee in runtime. (3) Attribute cohesiveness metrics often ignores the important correlation with the query node $q$. We formally define our CS-AG problem atop a $q$-centric attribute cohesiveness metric considering both textual and numerical attributes, for $k$-core model on homogeneous graphs. We show the problem is NP-hard. To solve it, we first propose an exact baseline with three pruning strategies. Then, we propose an index-free sampling-estimation-based method to quickly return an approximate community with an accuracy guarantee, in the form of a confidence interval. Once a good result satisfying a user-desired error bound is reached, we terminate it early. We extend it to heterogeneous graphs, $k$-truss model, and size-bounded CS. Comprehensive experimental studies on ten real-world datasets show its superiority, e.g., at least 1.54$\times$ (41.1$\times$ on average) faster in response time and a reliable relative error (within a user-specific error bound) of attribute cohesiveness is achieved.

Scalable Community Search with Accuracy Guarantee on Attributed Graphs

TL;DR

This work addresses the scalable Community Search over Attributed Graphs (CS-AG) problem, which seeks a connected

-core containing a query node while optimizing a

-centric attribute distance that combines textual and numerical attributes. It proves NP-hardness and delivers two complementary solutions: an exact baseline with three pruning strategies and an index-free sampling-estimation method that provides a runtime confidence interval guarantee on the attribute cohesiveness via Bag of Little Bootstrap and Hoeffding-based sampling. The approximate method can early-terminate when a user-specified relative error bound is met and extends to heterogeneous graphs, size-bounded settings, and alternative community models. Extensive experiments on ten real-world datasets show substantial speedups (often orders of magnitude) and reliable accuracy, validating the practicality of the approach for large-scale attributed graphs.

Abstract

Given an attributed graph

and a query node

, \underline{C}ommunity \underline{S}earch over \underline{A}ttributed \underline{G}raphs (CS-AG) aims to find a structure- and attribute-cohesive subgraph from

that contains

. Although CS-AG has been widely studied, they still face three challenges. (1) Exact methods based on graph traversal are time-consuming, especially for large graphs. Some tailored indices can improve efficiency, but introduce nonnegligible storage and maintenance overhead. (2) Approximate methods with a loose approximation ratio only provide a coarse-grained evaluation of a community's quality, rather than a reliable evaluation with an accuracy guarantee in runtime. (3) Attribute cohesiveness metrics often ignores the important correlation with the query node

. We formally define our CS-AG problem atop a

-centric attribute cohesiveness metric considering both textual and numerical attributes, for

-core model on homogeneous graphs. We show the problem is NP-hard. To solve it, we first propose an exact baseline with three pruning strategies. Then, we propose an index-free sampling-estimation-based method to quickly return an approximate community with an accuracy guarantee, in the form of a confidence interval. Once a good result satisfying a user-desired error bound is reached, we terminate it early. We extend it to heterogeneous graphs,

-truss model, and size-bounded CS. Comprehensive experimental studies on ten real-world datasets show its superiority, e.g., at least 1.54

(41.1

on average) faster in response time and a reliable relative error (within a user-specific error bound) of attribute cohesiveness is achieved.

Paper Structure (28 sections, 13 theorems, 16 equations, 10 figures, 6 tables, 1 algorithm)

This paper contains 28 sections, 13 theorems, 16 equations, 10 figures, 6 tables, 1 algorithm.

Introduction
Preliminaries and Problems
Preliminaries
Problem Definition
Hardness Analysis
Exact Baseline
Find the Maximal Connected $k$-core
Enumeration with Pruning Strategies
Complexity Analysis
Sampling-Estimation Solution
Sampling-based Maximal $\tilde{H}_k$ Finding
Estimation with Accuracy Guarantee
Error-based Incremental Sampling
Complexity Analysis
Extensions
...and 13 more sections

Key Result

Theorem 1

The $\tau$ R-MWCS problem is NP-hard.

Figures (10)

Figure 1: An example of CS: (a) A snapshot of IMDB with attributes at the bottom. (b)-(e) Different results of four methods.
Figure 2: An example of $k$-core and connected $k$-core
Figure 3: A search tree for the $\tilde{H}_2$ in Figure \ref{['fig:kcore']} (c)
Figure 4: The pipeline of our sampling-estimation method
Figure 5: Effectiveness (a)-(b) and efficiency (c)-(d) results over homogeneous graphs
...and 5 more figures

Theorems & Definitions (23)

Definition 1
Definition 2
Definition 3
Definition 4
Theorem 1
Theorem 2
Lemma 1
Theorem 3
Example 1
Theorem 4
...and 13 more

Scalable Community Search with Accuracy Guarantee on Attributed Graphs

TL;DR

Abstract

Scalable Community Search with Accuracy Guarantee on Attributed Graphs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (23)