Table of Contents
Fetching ...

Finding Super-spreaders in Network Cascades

Elchanan Mossel, Anirudh Sridhar

TL;DR

This work investigates learning structural features of a network from cascade traces generated by a continuous-time SI process when the underlying graph is unknown. It introduces a second-derivative based estimator that identifies high-degree vertices by analyzing the discretized second derivative of the infection curve at infection times, enabling exact recovery of the high-degree set for graphs with degree threshold $D=n^{\alpha}$ when $\alpha>3/4$ using a constant number of cascades. It also proves an information-theoretic lower bound showing that for $\alpha\in(0,1/2)$, at least $\log n / \log\log n$ cascades are needed, via a connection to sparse mixture detection; this implies that estimating super-spreaders can be nearly as hard as learning the whole graph in tree-like cases. The results reveal a phase transition in sample complexity across regimes of $\alpha$, provide a concrete algorithm (Algorithm 1) with provable guarantees, and outline open questions for the intermediate regime $[1/2,3/4]$ and extensions to noisy observations, general graphs, and other motifs.

Abstract

Suppose that a cascade (e.g., an epidemic) spreads on an unknown graph, and only the infection times of vertices are observed. What can be learned about the graph from the infection times caused by multiple distinct cascades? Most of the literature on this topic focuses on the task of recovering the entire graph, which requires $Ω( \log n)$ cascades for an $n$-vertex bounded degree graph. Here we ask a different question: can the important parts of the graph be estimated from just a few (i.e., constant number) of cascades, even as $n$ grows large? In this work, we focus on identifying super-spreaders (i.e., high-degree vertices) from infection times caused by a Susceptible-Infected process on a graph. Our first main result shows that vertices of degree greater than $n^{3/4}$ can indeed be estimated from a constant number of cascades. Our algorithm for doing so leverages a novel connection between vertex degrees and the second derivative of the cumulative infection curve. Conversely, we show that estimating vertices of degree smaller than $n^{1/2}$ requires at least $\log(n) / \log \log (n)$ cascades. Surprisingly, this matches (up to $\log \log n$ factors) the number of cascades needed to learn the \emph{entire} graph if it is a tree.

Finding Super-spreaders in Network Cascades

TL;DR

This work investigates learning structural features of a network from cascade traces generated by a continuous-time SI process when the underlying graph is unknown. It introduces a second-derivative based estimator that identifies high-degree vertices by analyzing the discretized second derivative of the infection curve at infection times, enabling exact recovery of the high-degree set for graphs with degree threshold when using a constant number of cascades. It also proves an information-theoretic lower bound showing that for , at least cascades are needed, via a connection to sparse mixture detection; this implies that estimating super-spreaders can be nearly as hard as learning the whole graph in tree-like cases. The results reveal a phase transition in sample complexity across regimes of , provide a concrete algorithm (Algorithm 1) with provable guarantees, and outline open questions for the intermediate regime and extensions to noisy observations, general graphs, and other motifs.

Abstract

Suppose that a cascade (e.g., an epidemic) spreads on an unknown graph, and only the infection times of vertices are observed. What can be learned about the graph from the infection times caused by multiple distinct cascades? Most of the literature on this topic focuses on the task of recovering the entire graph, which requires cascades for an -vertex bounded degree graph. Here we ask a different question: can the important parts of the graph be estimated from just a few (i.e., constant number) of cascades, even as grows large? In this work, we focus on identifying super-spreaders (i.e., high-degree vertices) from infection times caused by a Susceptible-Infected process on a graph. Our first main result shows that vertices of degree greater than can indeed be estimated from a constant number of cascades. Our algorithm for doing so leverages a novel connection between vertex degrees and the second derivative of the cumulative infection curve. Conversely, we show that estimating vertices of degree smaller than requires at least cascades. Surprisingly, this matches (up to factors) the number of cascades needed to learn the \emph{entire} graph if it is a tree.
Paper Structure (24 sections, 24 theorems, 141 equations, 3 figures)

This paper contains 24 sections, 24 theorems, 141 equations, 3 figures.

Key Result

Theorem 1.3

Suppose that $\alpha \in (3/4, 1)$ and that $D = n^\alpha$. If $K$ satisfies then there is an estimator $\widehat{\mathrm{HD}}$ such that for any $G \in {\mathcal{G}}(n,m,d,D)$ and any collection of source vertices $v_0 = (v_{0,1}, \ldots, v_{0,K})$, where $o(1) \to 0$ as $n \to \infty$.

Figures (3)

  • Figure 1: Phase diagram for the possibility and impossibility of estimating high-degree vertices in a graph $G$. Blue region: High-degree vertices can be estimated from more than $1/(\alpha - 3/4)$ traces. Red region: Estimating high-degree vertices is impossible, even when $G$ is known to be a tree. Green region: Full recovery of $G$ is possible if $G$ is a tree, hence estimation of high-degree vertices is also possible. White region: Regimes with small gap between the bounds provided by our analysis. Yellow region: The main open problem left given our work is to determine the sample complexity in this region.
  • Figure 2: Visualization of the edges contributing to $\mathop{\mathrm{\mathsf{cut}}}\nolimits( {\mathcal{I}}(t))$ before $(a)$ and after $(b)$ a vertex $v$ becomes infected. In Figure \ref{['fig:cut_after']}, the number of blue and red edges denote the positive and negative change to $\mathop{\mathrm{\mathsf{cut}}}\nolimits({\mathcal{I}}(t))$, respectively, upon $v$ being infected.
  • Figure 3: Plots of the discrete first $(a)$ and second derivative $(b)$ of the infection curve with $\delta = 0.075$ generated from a graph $G$ with approximately 500,000 vertices and one high-degree vertex. The infection time of the high-degree vertex can be identified by the red dotted line in $(a)$ and the large peak in the second derivative plot $(b)$, highlighted by the red rectangle. $G$ was chosen to be a balanced, 5-regular tree of height 8, where one vertex in the 6th layer of the tree has degree 7500.

Theorems & Definitions (54)

  • Definition 1.1
  • Theorem 1.3
  • Theorem 1.4
  • Theorem 1.5
  • Theorem 2.1
  • Remark 2.2
  • Remark 2.3
  • proof : Proof sketch of Theorem \ref{['thm:alg_3/4']}
  • Proposition 3.1
  • Lemma 3.2
  • ...and 44 more