Table of Contents
Fetching ...

Higher-order accurate two-sample network inference and network hashing

Meijia Shao, Dong Xia, Yuan Zhang, Qiong Wu, Shuo Chen

TL;DR

This paper develops a comprehensive toolbox, featuring a novel main method and its variants, all accompanied by strong theoretical guarantees, to address two-sample hypothesis testing for network comparison, and it is proved power-optimal.

Abstract

Two-sample hypothesis testing for network comparison presents many significant challenges, including: leveraging repeated network observations and known node registration, but without requiring them to operate; relaxing strong structural assumptions; achieving finite-sample higher-order accuracy; handling different network sizes and sparsity levels; fast computation and memory parsimony; controlling false discovery rate (FDR) in multiple testing; and theoretical understandings, particularly regarding finite-sample accuracy and minimax optimality. In this paper, we develop a comprehensive toolbox, featuring a novel main method and its variants, all accompanied by strong theoretical guarantees, to address these challenges. Our method outperforms existing tools in speed and accuracy, and it is proved power-optimal. Our algorithms are user-friendly and versatile in handling various data structures (single or repeated network observations; known or unknown node registration). We also develop an innovative framework for offline hashing and fast querying as a very useful tool for large network databases. We showcase the effectiveness of our method through comprehensive simulations and applications to two real-world datasets, which revealed intriguing new structures.

Higher-order accurate two-sample network inference and network hashing

TL;DR

This paper develops a comprehensive toolbox, featuring a novel main method and its variants, all accompanied by strong theoretical guarantees, to address two-sample hypothesis testing for network comparison, and it is proved power-optimal.

Abstract

Two-sample hypothesis testing for network comparison presents many significant challenges, including: leveraging repeated network observations and known node registration, but without requiring them to operate; relaxing strong structural assumptions; achieving finite-sample higher-order accuracy; handling different network sizes and sparsity levels; fast computation and memory parsimony; controlling false discovery rate (FDR) in multiple testing; and theoretical understandings, particularly regarding finite-sample accuracy and minimax optimality. In this paper, we develop a comprehensive toolbox, featuring a novel main method and its variants, all accompanied by strong theoretical guarantees, to address these challenges. Our method outperforms existing tools in speed and accuracy, and it is proved power-optimal. Our algorithms are user-friendly and versatile in handling various data structures (single or repeated network observations; known or unknown node registration). We also develop an innovative framework for offline hashing and fast querying as a very useful tool for large network databases. We showcase the effectiveness of our method through comprehensive simulations and applications to two real-world datasets, which revealed intriguing new structures.
Paper Structure (33 sections, 8 theorems, 46 equations, 12 figures, 9 tables, 4 algorithms)

This paper contains 33 sections, 8 theorems, 46 equations, 12 figures, 9 tables, 4 algorithms.

Key Result

Theorem 1

Assume: Define the population Edgeworth expansion $G_{m,n}(u)$ for $\widehat{T}_{m,n}+\delta_T$ as in (eq:Gmn). Let $\widehat{G}_{m,n}$ be its empirical version defined above. Then we have

Figures (12)

  • Figure 1: Comparison of type I error control (Row 1) and power difference (Row 2: $\varpi=0.05$ and Row 3: $\varpi=0.20$). Blue in Row 1 and green in Rows 2 and 3 indicate performance advantage of our method; red and brown indicate disadvantageous comparisons.
  • Figure 2: Database offline hashing and querying. Row 1: comparison of methods on query accuracy and time cost. In row 2, we kept the $X$-axis range consistent, but this cuts out some cyan bars on the far left. For plots with complete $X$-axes, see Section \ref{['supple::different graphon']} in Supplementary Material.
  • Figure 3: Control of FDR (dashed curves) and test power (solid curves) under different $\mathfrak{q}$ ($H_0$ proportion) and gaps between hypotheses ($\varpi$, marked as "shift" in the plots). Row 1: model 1 (keyword, $m$ nodes) vs. model 2 ($n$ nodes); row 2: model 1 vs model 3; row 3: model 3 vs model 4. Columns 1--4 are increasing network sizes $m=n\in\{40,80,160,320\}$.
  • Figure 4: Scenario 1: common node set. Row 1: $m=n=20$; row 2: $m=n=40$.
  • Figure 5: Scenario 2: independent node sets. Row 1: $m=n=20$; row 2: $m=n=40$.
  • ...and 7 more figures

Theorems & Definitions (9)

  • Theorem 1: Population and empirical Edgeworth expansions
  • Theorem 2
  • Theorem 3
  • Remark 1
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • Theorem 7: Asymptotic normality with automatic adaptation to indeterminate degeneracy
  • Theorem 8