Table of Contents
Fetching ...

Non-robustness of diffusion estimates on networks with measurement error

Arun G. Chandrasekhar, Paul Goldsmith-Pinkham, Tyler H. McCormick, Samuel Thau, Jerry Wei

TL;DR

This paper shows that diffusion forecasts on networks are highly fragile to vanishing measurement error in the network or seed location. By formalizing a polynomial-diffusion regime with a sparse, unobserved error graph E_n, it proves that small seed perturbations and missing links can drastically alter diffusion paths, while average parameter estimation (e.g., p_n and R0) remains possible. Monte Carlo simulations and three empirical applications (COVID mobility, rural India marketing diffusion, and China insurance uptake) illustrate substantial underestimation of diffusion when relying on observed networks. The results highlight fundamental limits on forecasting diffusion in noisy networks and suggest caution in policy design, advocating broader early intervention and careful consideration of data quality in network diffusion analyses.

Abstract

Network diffusion models are used to study things like disease transmission, information spread, and technology adoption. However, small amounts of mismeasurement are extremely likely in the networks constructed to operationalize these models. We show that estimates of diffusions are highly non-robust to this measurement error. First, we show that even when measurement error is vanishingly small, such that the share of missed links is close to zero, forecasts about the extent of diffusion will greatly underestimate the truth. Second, a small mismeasurement in the identity of the initial seed generates a large shift in the locations of expected diffusion path. We show that both of these results still hold when the vanishing measurement error is only local in nature. Such non-robustness in forecasting exists even under conditions where the basic reproductive number is consistently estimable. Possible solutions, such as estimating the measurement error or implementing widespread detection efforts, still face difficulties because the number of missed links are so small. Finally, we conduct Monte Carlo simulations on simulated networks, and real networks from three settings: travel data from the COVID-19 pandemic in the western US, a mobile phone marketing campaign in rural India, and in an insurance experiment in China.

Non-robustness of diffusion estimates on networks with measurement error

TL;DR

This paper shows that diffusion forecasts on networks are highly fragile to vanishing measurement error in the network or seed location. By formalizing a polynomial-diffusion regime with a sparse, unobserved error graph E_n, it proves that small seed perturbations and missing links can drastically alter diffusion paths, while average parameter estimation (e.g., p_n and R0) remains possible. Monte Carlo simulations and three empirical applications (COVID mobility, rural India marketing diffusion, and China insurance uptake) illustrate substantial underestimation of diffusion when relying on observed networks. The results highlight fundamental limits on forecasting diffusion in noisy networks and suggest caution in policy design, advocating broader early intervention and careful consideration of data quality in network diffusion analyses.

Abstract

Network diffusion models are used to study things like disease transmission, information spread, and technology adoption. However, small amounts of mismeasurement are extremely likely in the networks constructed to operationalize these models. We show that estimates of diffusions are highly non-robust to this measurement error. First, we show that even when measurement error is vanishingly small, such that the share of missed links is close to zero, forecasts about the extent of diffusion will greatly underestimate the truth. Second, a small mismeasurement in the identity of the initial seed generates a large shift in the locations of expected diffusion path. We show that both of these results still hold when the vanishing measurement error is only local in nature. Such non-robustness in forecasting exists even under conditions where the basic reproductive number is consistently estimable. Possible solutions, such as estimating the measurement error or implementing widespread detection efforts, still face difficulties because the number of missed links are so small. Finally, we conduct Monte Carlo simulations on simulated networks, and real networks from three settings: travel data from the COVID-19 pandemic in the western US, a mobile phone marketing campaign in rural India, and in an insurance experiment in China.
Paper Structure (30 sections, 8 theorems, 36 equations, 15 figures, 8 tables)

This paper contains 30 sections, 8 theorems, 36 equations, 15 figures, 8 tables.

Key Result

Theorem 1

Let Assumptions ass:disease, ass:forecast-time, and ass:beta hold. Let $i_0$ be an arbitrary initial seed and consider the stochastic sequence $\{G_n\}_n$ comprised of a fixed sequence of $\{L_n\}_n$ and random $\{E_n\}_n$. Let $U_{n,i_0} = B_{i_0}(a_n)$ be a ball on $G_n$ of radius $a_n$ around $i_

Figures (15)

  • Figure 1: A heuristic construction of $J_{i_0}$ using $\mathbb{R}^2$ to represent $L_n$. Let $e_1$ and $e_2$ be the closest and second closest nodes in $L_n$ that also have a link in $E_n$. The smaller red dotted circle denotes $U_{n,i_0} := B_{i_0}(b_n)$, while the larger denotes $B_{e_2}(a_n)$. The intersection gives the set $J_{i_0}$.
  • Figure 2: Panels \ref{['fig:mc-sens-dep-4']} and \ref{['fig:mc-sens-dep-2']} show simulations of Theorem \ref{['thm:sensitive-dep']}, while panels \ref{['fig:mc-ratio-4']} and \ref{['fig:mc-ratio-2']} show simulations of Theorem \ref{['thm:main-polynomial']}. Panels \ref{['fig:mc-sens-dep-4']} and \ref{['fig:mc-sens-dep-2']} each fix a separate draw of $E_n$, then each choose a fixed $j_0$. We then simulate 2,500 diffusion processes while tracking the Jaccard index after perturbing the initial seed location. In Panels \ref{['fig:mc-ratio-4']} and \ref{['fig:mc-ratio-2']}, we simulate 2,500 iterations of the diffusion process on both $L_n$ and $G_n$ for each value of $q$, re-drawing $E_n$ for each simulation. We then track the expected number of ever-activated nodes under each simulation at each time period, and then take the ratio.
  • Figure 3: Simulated version of Theorems \ref{['thm:sensitive-dep']} and \ref{['thm:main-polynomial']} on $L_n$ and $G_n$ generated from Census tract flow data in California and Nevada. Panels (A) and (C) show simulations of Theorem \ref{['thm:sensitive-dep']}, while Panels (B) and (D) show simulations of Theorem \ref{['thm:main-polynomial']}.
  • Figure 4: Simulations of Theorems \ref{['thm:sensitive-dep']} and \ref{['thm:main-polynomial']} on village networks from Karnataka, India. Panel (A) shows a version of Theorem \ref{['thm:sensitive-dep']}. We perturb one seed uniformly at random by a single set in each village. Then, we simulate 2,500 diffusion processes on a fixed draw of $G_n$, computing the average Jaccard index of the process. Panel (B) shows a version of Theorem \ref{['thm:main-polynomial']}. We take 2,500 diffusion simulations on $L_n$ and $G_n$, where $G_n$ is constructed at the village level with $\beta_n = \frac{1}{2n_v}$. $n_v$ is the number of households in the village.
  • Figure 5: The joint distribution of the difference in $\hat{\gamma}(L_n)$ and $\hat{\gamma}(G_n)$ (in percentage terms) and the level at which we can reject the null that $\hat{\gamma}(L_n) = 0$ for different values of $k$. As $k$ increases, $\beta_{v,n}$ decreases. In parenthesis, we include the average value of the corresponding $\beta_n$ across villages. The red, dashed, vertical line denotes the level at which we can reject $\hat{\gamma}(G_n) = 0$. The black dotted line shows rejection at the 95 percent level.
  • ...and 10 more figures

Theorems & Definitions (17)

  • Theorem 1
  • Lemma 1
  • Theorem 2
  • Remark 1
  • proof
  • Proposition 1
  • Theorem 3
  • Theorem 4
  • Proposition 2
  • proof : Proof of Lemma \ref{['lem:regions']}
  • ...and 7 more