Table of Contents
Fetching ...

Flexible Modeling of Information Diffusion on Networks with Statistical Guarantees

Alexander Kagan, Elizaveta Levina, Ji Zhu

TL;DR

This work introduces the General Linear Threshold (GLT) diffusion framework, a flexible, threshold-based model that unifies Linear Threshold and Independent Cascade dynamics while allowing node-specific threshold distributions. It provides a principled likelihood-based approach to estimate edge weights, proves identifiability conditions, and establishes finite-sample error bounds and asymptotic normality under mild assumptions. Extensions to partially observed traces and parametric threshold estimation enable practical inference in real-world diffusion settings, including uncertainty quantification and robust influence maximization. Empirical results on synthetic and real networks, including a movie-rating dataset, demonstrate improved edge-weight estimation, activation-probability prediction, and IM performance when using GLT with learned threshold distributions.

Abstract

Modeling information spread through a network is one of the key problems of network analysis, with applications in a wide array of areas such as marketing and public health. Most approaches assume that the spread is governed by some probabilistic diffusion model, often parameterized by the strength of connections between network members (edge weights), highlighting the need for methods that can accurately estimate them. Multiple prior works suggest such estimators for particular diffusion models; however, most of them lack a rigorous statistical analysis that would establish the asymptotic properties of the estimator and allow for uncertainty quantification. In this paper, we develop a likelihood-based approach to estimate edge weights from the observed information diffusion paths under the proposed General Linear Threshold (GLT) model, a broad class of discrete-time information diffusion models that includes both the well-known linear threshold (LT) and independent cascade (IC) models. We first derive necessary and sufficient conditions that make the edge weights identifiable under this model. Then, we derive a finite sample error bound for the estimator and demonstrate that it is asymptotically normal under mild conditions. We conclude by studying the GLT model in the context of the Influence Maximization (IM) problem, that is, the task of selecting a subset of $k$ nodes to start the diffusion, so that the average information spread is maximized. Extensive experiments on synthetic and real-world networks demonstrate that the flexibility of the proposed class of GLT models, coupled with the proposed estimation and inference framework for its parameters, can significantly improve estimation of spread from a given subset of nodes, prediction of node activation, and the quality of the IM problem solutions.

Flexible Modeling of Information Diffusion on Networks with Statistical Guarantees

TL;DR

This work introduces the General Linear Threshold (GLT) diffusion framework, a flexible, threshold-based model that unifies Linear Threshold and Independent Cascade dynamics while allowing node-specific threshold distributions. It provides a principled likelihood-based approach to estimate edge weights, proves identifiability conditions, and establishes finite-sample error bounds and asymptotic normality under mild assumptions. Extensions to partially observed traces and parametric threshold estimation enable practical inference in real-world diffusion settings, including uncertainty quantification and robust influence maximization. Empirical results on synthetic and real networks, including a movie-rating dataset, demonstrate improved edge-weight estimation, activation-probability prediction, and IM performance when using GLT with learned threshold distributions.

Abstract

Modeling information spread through a network is one of the key problems of network analysis, with applications in a wide array of areas such as marketing and public health. Most approaches assume that the spread is governed by some probabilistic diffusion model, often parameterized by the strength of connections between network members (edge weights), highlighting the need for methods that can accurately estimate them. Multiple prior works suggest such estimators for particular diffusion models; however, most of them lack a rigorous statistical analysis that would establish the asymptotic properties of the estimator and allow for uncertainty quantification. In this paper, we develop a likelihood-based approach to estimate edge weights from the observed information diffusion paths under the proposed General Linear Threshold (GLT) model, a broad class of discrete-time information diffusion models that includes both the well-known linear threshold (LT) and independent cascade (IC) models. We first derive necessary and sufficient conditions that make the edge weights identifiable under this model. Then, we derive a finite sample error bound for the estimator and demonstrate that it is asymptotically normal under mild conditions. We conclude by studying the GLT model in the context of the Influence Maximization (IM) problem, that is, the task of selecting a subset of nodes to start the diffusion, so that the average information spread is maximized. Extensive experiments on synthetic and real-world networks demonstrate that the flexibility of the proposed class of GLT models, coupled with the proposed estimation and inference framework for its parameters, can significantly improve estimation of spread from a given subset of nodes, prediction of node activation, and the quality of the IM problem solutions.

Paper Structure

This paper contains 24 sections, 18 theorems, 108 equations, 8 figures, 1 algorithm.

Key Result

Proposition 3.1

The class of IC models is equivalent to the class of GLT models with all node thresholds distributed as $U_v\sim \operatorname{Exponential(1)}$, that is, with $F_v(x) = 1-e^{-x}$.

Figures (8)

  • Figure 1: An example network with three communities with different levels of receptiveness to new information, all modeled with the Beta distribution.
  • Figure 2: Relationship between different diffusion models.
  • Figure 3: A star graph of in-degree $m$: $V=\{1, \ldots, m+1\}$, $E=\{(1, m+1) \ldots, (m, m+1)\}$.
  • Figure 4: Left: Relative MAE of the LT estimator as a function of the number of nodes $n$ for different densities $k / n$, with $N=2000$ traces and $d_{\max} = 1$. Right: Relative MAE as a function of the number of traces $N$ for different maximum node in-degrees $d_{\max}$, with $n = 100$ and density $k/n = 0.1$. The error bars represent two standard errors and are calculated from 10 repetitions of each experiment.
  • Figure 5: Estimated node activation probabilities together with the corresponding Delta method confidence intervals computed under different GLT models with $\operatorname{Beta}(2, 1)$-GLT as the ground truth.
  • ...and 3 more figures

Theorems & Definitions (49)

  • Definition 1: Feasible trace
  • Definition 2: Diffusion model
  • Remark 1
  • Example 1: Linear Threshold (LT) Model
  • Example 2: Independent Cascade (IC) Model
  • Example 3: Triggering Model
  • Definition 3: Diffusion model subclass
  • Definition 4: General Linear Threshold (GLT) model
  • Proposition 3.1
  • proof
  • ...and 39 more