Table of Contents
Fetching ...

Sample Complexity of Correlation Detection in the Gaussian Wigner Model

Dong Huang, Pengkun Yang

TL;DR

The paper studies correlation detection between two unlabeled Gaussian Wigner graphs when two induced subgraphs of size $s$ are sampled from graphs with $n$ vertices. It establishes the optimal sample-size scaling $s^2 \asymp \left( \frac{n\log n}{\log\left(1/(1-\rho^2)\right)} \vee n \right)$ for reliable detection and provides both possibility and impossibility results, with a polynomial-time approximate detector based on clique seeds achieving practical performance. The analysis introduces an $f$-based similarity statistic and leverages the conditional second moment to handle partial observations, yielding two detectors: a maximal-overlap estimator and a minimal-mean-squared-error estimator, each with regime-specific thresholds. The work has practical implications for efficient correlation testing and privacy considerations in network data, and it outlines extensions to other graph models and computational-hardness perspectives through the low-degree framework.

Abstract

Correlation analysis is a fundamental step in uncovering meaningful insights from complex datasets. In this paper, we study the problem of detecting correlations between two random graphs following the Gaussian Wigner model with unlabeled vertices. Specifically, the task is formulated as a hypothesis testing problem: under the null hypothesis, the two graphs are independent, while under the alternative hypothesis, they are edge-correlated through a latent vertex permutation, yet maintain the same marginal distributions as under the null. We focus on the scenario where two induced subgraphs, each with a fixed number of vertices, are sampled. We determine the optimal rate for the sample size required for correlation detection, derived through an analysis of the conditional second moment. Additionally, we propose an efficient approximate algorithm that significantly reduces running time.

Sample Complexity of Correlation Detection in the Gaussian Wigner Model

TL;DR

The paper studies correlation detection between two unlabeled Gaussian Wigner graphs when two induced subgraphs of size are sampled from graphs with vertices. It establishes the optimal sample-size scaling for reliable detection and provides both possibility and impossibility results, with a polynomial-time approximate detector based on clique seeds achieving practical performance. The analysis introduces an -based similarity statistic and leverages the conditional second moment to handle partial observations, yielding two detectors: a maximal-overlap estimator and a minimal-mean-squared-error estimator, each with regime-specific thresholds. The work has practical implications for efficient correlation testing and privacy considerations in network data, and it outlines extensions to other graph models and computational-hardness perspectives through the low-degree framework.

Abstract

Correlation analysis is a fundamental step in uncovering meaningful insights from complex datasets. In this paper, we study the problem of detecting correlations between two random graphs following the Gaussian Wigner model with unlabeled vertices. Specifically, the task is formulated as a hypothesis testing problem: under the null hypothesis, the two graphs are independent, while under the alternative hypothesis, they are edge-correlated through a latent vertex permutation, yet maintain the same marginal distributions as under the null. We focus on the scenario where two induced subgraphs, each with a fixed number of vertices, are sampled. We determine the optimal rate for the sample size required for correlation detection, derived through an analysis of the conditional second moment. Additionally, we propose an efficient approximate algorithm that significantly reduces running time.

Paper Structure

This paper contains 32 sections, 11 theorems, 67 equations, 3 figures, 1 algorithm.

Key Result

Theorem 1

There exist constants $\overline{C},\underline{C}$ such that, for any $0<\rho<1$, if $s^2\ge \overline{C}\left( \frac{ n\log n}{\log\left( 1/(1-\rho^2) \right)}\vee n \right)$, Moreover, if $s^2 = \omega(n)$, $\mathsf{TV}\left( \mathcal{P},\mathcal{Q} \right) = 1-o(1)$. Conversely, if $s^2\le \underline{C}\left( \frac{n\log n}{\log\left( 1/(1-\rho^2) \right)}\vee n \right)$, Moreover, if $s^2 \l

Figures (3)

  • Figure 1: The histogram of the approximate test statistic $\sum_{e\in \binom{S_0}{2}} \beta_e\left( \mathcal{H}_{\pi_0}^f \right)$ in Algorithm \ref{['alg:clique-based']} over 100 pairs of graphs, where the blue one represents the correlated Gaussian Wigner model, and the green one represents the independent graphs.
  • Figure 2: Comparison for the ROC curve of the approximate test statistic for different sample size $s$.
  • Figure 3: Comparison for the ROC curve of the approximate test statistic for different correlation coefficients $\rho$.

Theorems & Definitions (21)

  • Definition 1: Correlated Gaussian Wigner model
  • Definition 2: Strong and weak detection
  • Theorem 1
  • Lemma 1
  • Proposition 1
  • Proposition 2
  • Remark 1
  • Remark 2
  • Proposition 3
  • Definition 3: Correlated functional digraph
  • ...and 11 more