A Sampling-based Framework for Hypothesis Testing on Large Attributed Graphs

Yun Wang; Chrysanthi Kosyfaki; Sihem Amer-Yahia; Reynold Cheng

A Sampling-based Framework for Hypothesis Testing on Large Attributed Graphs

Yun Wang, Chrysanthi Kosyfaki, Sihem Amer-Yahia, Reynold Cheng

TL;DR

This work addresses hypothesis testing on large attributed graphs by formalizing node, edge, and path hypotheses and introducing a sampling-based framework. It introduces PHASE, a Path-Hypothesis-Aware SamplEr, which biases sampling toward elements specified by the hypothesis, and PHASE_{opt}, which uses non-backtracking walks and neighbor-limiting tricks to reduce runtime while preserving accuracy. The authors prove convergence properties for hypothesis estimators and demonstrate through experiments on MovieLens, DBLP, and Yelp that PHASE_{opt} can be at least 43x faster than PHASE with <=4% accuracy loss, while delivering tighter p-values and confidence intervals. The approach significantly improves practical hypothesis testing on large graphs and supports longer-path and more complex hypotheses, offering a path toward scalable, hypothesis-aware graph analytics.

Abstract

Hypothesis testing is a statistical method used to draw conclusions about populations from sample data, typically represented in tables. With the prevalence of graph representations in real-life applications, hypothesis testing in graphs is gaining importance. In this work, we formalize node, edge, and path hypotheses in attributed graphs. We develop a sampling-based hypothesis testing framework, which can accommodate existing hypothesis-agnostic graph sampling methods. To achieve accurate and efficient sampling, we then propose a Path-Hypothesis-Aware SamplEr, PHASE, an m- dimensional random walk that accounts for the paths specified in a hypothesis. We further optimize its time efficiency and propose PHASEopt. Experiments on real datasets demonstrate the ability of our framework to leverage common graph sampling methods for hypothesis testing, and the superiority of hypothesis-aware sampling in terms of accuracy and time efficiency.

A Sampling-based Framework for Hypothesis Testing on Large Attributed Graphs

TL;DR

Abstract

Paper Structure (20 sections, 1 theorem, 12 equations, 8 figures, 4 tables, 2 algorithms)

This paper contains 20 sections, 1 theorem, 12 equations, 8 figures, 4 tables, 2 algorithms.

Introduction
Definitions
Attributed Graphs
Hypotheses on Attributed Graphs
Problem Statement and Challenges
Sampling-based Hypothesis Testing
Hypothesis-Agnostic Samplers
Hypothesis-Aware Samplers
PHASE Algorithm
$\text{PHASE}_{\text{opt}}$ Algorithm
Convergence of Hypothesis Estimators
Experiments
Experimental Setup
Evaluation Measures
PHASE vs $\text{PHASE}_{\text{opt}}$
...and 5 more sections

Key Result

Theorem 1

For any function $f$, where $\sum_{(u,v)\in \mathcal{E}}|f(u,v)|<\infty$, almost surely, i.e. the event occurs with probability one.

Figures (8)

Figure 1: DBLP network schema and paths
Figure 2: The Sampling-based Hypothesis Testing Framework on attributed graphs.
Figure 3: Transition probability matrices ${Q}$ for (a) node, (b) edge, and (c) path $(l=2)$ hypotheses (up to the first four rows). $x_i$ represents nodes in $\mathcal{G}$ satisfying the $i$-th node modifier on $\mathcal{P}$ and $y$ represents other nodes.
Figure 4: The convergence of hypothesis estimator for two path hypotheses: DB-P3 (left) and YP-P3 (right).
Figure 5: DBLP p-value and CI plot for the hypothesis DB-P1
...and 3 more figures

Theorems & Definitions (4)

Definition 1: Attributed Graph
Definition 2: Path
Definition 3: Path Hypothesis
Theorem 1: SLLN

A Sampling-based Framework for Hypothesis Testing on Large Attributed Graphs

TL;DR

Abstract

A Sampling-based Framework for Hypothesis Testing on Large Attributed Graphs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (4)