Table of Contents
Fetching ...

Estimating Graph Dimension with Cross-validated Eigenvalues

Fan Chen, Sebastien Roch, Karl Rohe, Shuqi Yu

TL;DR

This work introduces cross-validated eigenvalues to estimate the latent dimension $k$ in random graph models without strict parametric assumptions. The method relies on edge splitting to construct independent split graphs, preserving population eigenvectors, and a central limit theorem to produce p-values for each eigenvector, enabling consistent estimation of $k$ when all signal dimensions are detectable. It provides a flexible, scalable alternative to existing techniques, with theoretical guarantees under Poisson and Bernoulli graph models and strong empirical performance on simulated and real networks. The approach yields interpretable results and competitive accuracy with substantially reduced computational cost, broadening practical applicability in network science and high-dimensional spectral inference.

Abstract

In applied multivariate statistics, estimating the number of latent dimensions or the number of clusters, $k$, is a fundamental and recurring problem. We study a sequence of statistics called "cross-validated eigenvalues." Under a large class of random graph models, including both Poisson and Bernoulli edges, without parametric assumptions, we provide a $p$-value for each cross-validated eigenvalue. It tests the null hypothesis that the sample eigenvector is orthogonal to (i.e., uncorrelated with) the true latent dimensions. This approach naturally adapts to problems where some dimensions are not statistically detectable. In scenarios where all $k$ dimensions can be estimated, we show that our procedure consistently estimates $k$. In simulations and data example, the proposed estimator compares favorably to alternative approaches in both computational and statistical performance.

Estimating Graph Dimension with Cross-validated Eigenvalues

TL;DR

This work introduces cross-validated eigenvalues to estimate the latent dimension in random graph models without strict parametric assumptions. The method relies on edge splitting to construct independent split graphs, preserving population eigenvectors, and a central limit theorem to produce p-values for each eigenvector, enabling consistent estimation of when all signal dimensions are detectable. It provides a flexible, scalable alternative to existing techniques, with theoretical guarantees under Poisson and Bernoulli graph models and strong empirical performance on simulated and real networks. The approach yields interpretable results and competitive accuracy with substantially reduced computational cost, broadening practical applicability in network science and high-dimensional spectral inference.

Abstract

In applied multivariate statistics, estimating the number of latent dimensions or the number of clusters, , is a fundamental and recurring problem. We study a sequence of statistics called "cross-validated eigenvalues." Under a large class of random graph models, including both Poisson and Bernoulli edges, without parametric assumptions, we provide a -value for each cross-validated eigenvalue. It tests the null hypothesis that the sample eigenvector is orthogonal to (i.e., uncorrelated with) the true latent dimensions. This approach naturally adapts to problems where some dimensions are not statistically detectable. In scenarios where all dimensions can be estimated, we show that our procedure consistently estimates . In simulations and data example, the proposed estimator compares favorably to alternative approaches in both computational and statistical performance.

Paper Structure

This paper contains 43 sections, 17 theorems, 101 equations, 12 figures, 2 tables, 3 algorithms.

Key Result

Proposition 2.1

For $j = 1, \dots, q$, $\hat{\lambda}_j = \lambda_{P}(\hat{x}_j)$ is the solution to

Figures (12)

  • Figure 1: In these examples, it is difficult to detect a gap or elbow. In the left panel, the graph is simulated from a Degree-Corrected Stochastic Blockmodel with $n=2560$. In the right panel, the graph is a citation graph among $n = 22,688$ academic journals. Displayed are the largest 150 eigenvalues of the normalized and regularized adjacency matrix.
  • Figure 2: In the left panel, the black line gives the sample eigenvalues (repeated from the left panel of Figure \ref{['fig:motivation_scree']}) and the orange line gives the $k=128$ non-zero population eigenvalues. The first two eigenvalues have been removed to improve the display. The blue line gives cross-validated eigenvalues, and the red line gives their population version, cross-population eigenvalue. In the right panel, the black line depicts the Z-scores for cross-validated eigenvalues. The horizontal black line corresponds to the cutoff for 0.05 significance level (two-side). In this example, a good choice for $\hat{k}$ would be around 60.
  • Figure 3: In the left panel, the black line gives the empirical eigenvalues (repeated from the right panel of Figure \ref{['fig:motivation_scree']}). The blue line gives cross-validated eigenvalues. In the right panel, the Z-scores are calculated under the null hypothesis that each cross-population eigenvalue is zero. The horizontal black line corresponds to the cutoff at 0.05 significance level.
  • Figure 4: Comparison of accuracy for different graph dimensionality estimates under the DCSBM. The panel strips on the top indicate the node degree distribution used. Within each panel, each colored line depicts the relative error of each estimation method as the average node degree increases. Each point on the lines are averaged across 100 repeated experiments.
  • Figure 5: Comparison of runtime for the different graph dimensionality methods. Each colored bar indicates the runtime of applying each method on a DCSBM graph with 2000 nodes and 10 blocks. The maximum graph dimensionality is set to 15 for all methods. The runtime was averaged across 100 repeated experiments.
  • ...and 7 more figures

Theorems & Definitions (44)

  • Definition 1: Poisson random graph
  • Remark 2.1: Directed edges
  • Remark 2.2: Bernoulli edges
  • Remark 2.3: Random dot product model
  • Proposition 2.1: lam_nonparametric_2016
  • Lemma 3.1
  • Corollary 3.1
  • Proposition 3.1
  • proof
  • Theorem 3.1: CLT for cross-validated eigenvalues
  • ...and 34 more