Table of Contents
Fetching ...

Scalable network reconstruction in subquadratic time

Tiago P. Peixoto

TL;DR

The paper tackles the scalability challenge of reconstructing networks from observational data, which traditionally incurs a quadratic $O(N^2)$ cost. It introduces a Greedy Coordinate Descent (GCD) framework that updates only $\kappa N$ edge weights per iteration by identifying promising edge candidates via a FindBest subroutine built on NNDescent for approximate $k$-nearest-neighbor search, thereby achieving subquadratic, data-dependent runtime. Theoretical analysis yields upper bounds such as $O\left(\kappa^{3/2} N^{3/2} \log N\right)$ and typical log-linear behavior $O\left(N\log^{2}N\right)$ under plausible assumptions, with tighter results for specific degree distributions. Empirically, the method delivers orders-of-magnitude speedups over CD on large synthetic and real datasets (Ising and Gaussian models), including microbiome and gene-expression networks with hundreds of thousands to millions of nodes, and benefits from straightforward parallelization.

Abstract

Network reconstruction consists in determining the unobserved pairwise couplings between $N$ nodes given only observational data on the resulting behavior that is conditioned on those couplings -- typically a time-series or independent samples from a graphical model. A major obstacle to the scalability of algorithms proposed for this problem is a seemingly unavoidable quadratic complexity of $Ω(N^2)$, corresponding to the requirement of each possible pairwise coupling being contemplated at least once, despite the fact that most networks of interest are sparse, with a number of non-zero couplings that is only $O(N)$. Here we present a general algorithm applicable to a broad range of reconstruction problems that significantly outperforms this quadratic baseline. Our algorithm relies on a stochastic second neighbor search (Dong et al., 2011) that produces the best edge candidates with high probability, thus bypassing an exhaustive quadratic search. If we rely on the conjecture that the second-neighbor search finishes in log-linear time (Baron & Darling, 2020; 2022), we demonstrate theoretically that our algorithm finishes in subquadratic time, with a data-dependent complexity loosely upper bounded by $O(N^{3/2}\log N)$, but with a more typical log-linear complexity of $O(N\log^2N)$. In practice, we show that our algorithm achieves a performance that is many orders of magnitude faster than the quadratic baseline -- in a manner consistent with our theoretical analysis -- allows for easy parallelization, and thus enables the reconstruction of networks with hundreds of thousands and even millions of nodes and edges.

Scalable network reconstruction in subquadratic time

TL;DR

The paper tackles the scalability challenge of reconstructing networks from observational data, which traditionally incurs a quadratic cost. It introduces a Greedy Coordinate Descent (GCD) framework that updates only edge weights per iteration by identifying promising edge candidates via a FindBest subroutine built on NNDescent for approximate -nearest-neighbor search, thereby achieving subquadratic, data-dependent runtime. Theoretical analysis yields upper bounds such as and typical log-linear behavior under plausible assumptions, with tighter results for specific degree distributions. Empirically, the method delivers orders-of-magnitude speedups over CD on large synthetic and real datasets (Ising and Gaussian models), including microbiome and gene-expression networks with hundreds of thousands to millions of nodes, and benefits from straightforward parallelization.

Abstract

Network reconstruction consists in determining the unobserved pairwise couplings between nodes given only observational data on the resulting behavior that is conditioned on those couplings -- typically a time-series or independent samples from a graphical model. A major obstacle to the scalability of algorithms proposed for this problem is a seemingly unavoidable quadratic complexity of , corresponding to the requirement of each possible pairwise coupling being contemplated at least once, despite the fact that most networks of interest are sparse, with a number of non-zero couplings that is only . Here we present a general algorithm applicable to a broad range of reconstruction problems that significantly outperforms this quadratic baseline. Our algorithm relies on a stochastic second neighbor search (Dong et al., 2011) that produces the best edge candidates with high probability, thus bypassing an exhaustive quadratic search. If we rely on the conjecture that the second-neighbor search finishes in log-linear time (Baron & Darling, 2020; 2022), we demonstrate theoretically that our algorithm finishes in subquadratic time, with a data-dependent complexity loosely upper bounded by , but with a more typical log-linear complexity of . In practice, we show that our algorithm achieves a performance that is many orders of magnitude faster than the quadratic baseline -- in a manner consistent with our theoretical analysis -- allows for easy parallelization, and thus enables the reconstruction of networks with hundreds of thousands and even millions of nodes and edges.
Paper Structure (16 sections, 7 theorems, 29 equations, 7 figures, 4 algorithms)

This paper contains 16 sections, 7 theorems, 29 equations, 7 figures, 4 algorithms.

Key Result

Lemma 1

Assuming the GCD algorithm alg:gcd converges after $\tau$ iterations, with $\tau$ independent of $N$---implying that there are at most $O(N)$ nonzero entries in $\bm W$---then its overall algorithmic complexity is determined solely by the $\text{FindBest}(\kappa N, \mathcal{V}, \text{d})$ function.

Figures (7)

  • Figure 1: Diagrammatic representation of the sets $\mathcal{S}$ and $\mathcal{S}'$ in algorithm \ref{['alg:mclosest']} for $k=3$. Edges marked in red belong to set $\mathcal{D}^+$, i.e. the $2m$ directed pairs $(i,j)$ with smallest $\text{d}(i,j)$. Note that reciprocal edges need not both belong to $\mathcal{D}^+$, despite $\text{d}(i,j)$ being symmetric, since ties are broken arbitrarily. The nodes in red have all out-edges (nearest neighbors) in set $\mathcal{D}$, and hence are assigned to set $\mathcal{S}'$. Since the set of $m$ best pairs could still contain undiscovered pairs of elements in $\mathcal{S}'$, the search needs to continue recursively for members of this set.
  • Figure 2: Example of our greedy coordinate descent algorithm for a covariance selection problem on a simulated dataset composed of $M=500$ samples given a network of friendships among high-school students moody_race_2001. The panels show intermediary results of the algorithm, starting from an empty network [i.e. $W_{ij} = 0$ for all $(i,j)$]. The top row shows (a) the random initialization of NNDescent (algorithm \ref{['alg:knn']}) with $k=4$, (b) its final result, and (c) the exact result found with an exhaustive algorithm. The middle row shows (d) the result of the $m=\kappa N$ best updates using algorithm \ref{['alg:mclosest']} with $\kappa=1$ and (e) the exact result according to an exhaustive algorithm. The edge colors indicate the value of $\max_{W_{ij}}\pi(\bm{W})$. The bottom row shows the first four iterations of the GCD algorithm.
  • Figure 3: Runtime of the FindBest function (algorithm \ref{['alg:mclosest']}), for different values of $\kappa$, on $M=10$ samples of a multivariate Gaussian (see Appendix \ref{['app:models']}) on $N$ nodes and nonzero entries of $\bm W$ sampled as an Erdős-Rényi graph with mean degree $5$ and nonzero weights independently normally sampled with mean $-10^{3}$ and standard deviation 10, and diagonal entries $W_{ii}=\sum_{j\ne i}|W_{ij}|/(1-\epsilon)^{2}$ with $\epsilon=10^{-3}$. The results show averages over $10$ independent problem instances.
  • Figure 4: Cumulative recall rates for the FindBest function for different values of $\kappa$ on a variety of empirical data and reconstruction objectives, as shown in the legend (see Appendix \ref{['app:models']}). The results shows averages over 10 runs of the algorithm.
  • Figure 5: Convergence of the CD and GCD algorithms for artificial data sampled from a multivariate Gaussian (same parameterization as Fig. \ref{['fig:scaling']} but with $M=100$ samples) for three different values of the number of nodes $N$ and values of $\kappa$, together with the CD baseline. The bottom panel shows the results obtained for empirical data for the human microbiome samples of the elbow joint, using the Ising model.
  • ...and 2 more figures

Theorems & Definitions (14)

  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Theorem 3: Upper bound on FindBest
  • proof
  • Lemma 4
  • proof
  • Theorem 5
  • proof
  • ...and 4 more