Flexible Modeling of Information Diffusion on Networks with Statistical Guarantees
Alexander Kagan, Elizaveta Levina, Ji Zhu
TL;DR
This work introduces the General Linear Threshold (GLT) diffusion framework, a flexible, threshold-based model that unifies Linear Threshold and Independent Cascade dynamics while allowing node-specific threshold distributions. It provides a principled likelihood-based approach to estimate edge weights, proves identifiability conditions, and establishes finite-sample error bounds and asymptotic normality under mild assumptions. Extensions to partially observed traces and parametric threshold estimation enable practical inference in real-world diffusion settings, including uncertainty quantification and robust influence maximization. Empirical results on synthetic and real networks, including a movie-rating dataset, demonstrate improved edge-weight estimation, activation-probability prediction, and IM performance when using GLT with learned threshold distributions.
Abstract
Modeling information spread through a network is one of the key problems of network analysis, with applications in a wide array of areas such as marketing and public health. Most approaches assume that the spread is governed by some probabilistic diffusion model, often parameterized by the strength of connections between network members (edge weights), highlighting the need for methods that can accurately estimate them. Multiple prior works suggest such estimators for particular diffusion models; however, most of them lack a rigorous statistical analysis that would establish the asymptotic properties of the estimator and allow for uncertainty quantification. In this paper, we develop a likelihood-based approach to estimate edge weights from the observed information diffusion paths under the proposed General Linear Threshold (GLT) model, a broad class of discrete-time information diffusion models that includes both the well-known linear threshold (LT) and independent cascade (IC) models. We first derive necessary and sufficient conditions that make the edge weights identifiable under this model. Then, we derive a finite sample error bound for the estimator and demonstrate that it is asymptotically normal under mild conditions. We conclude by studying the GLT model in the context of the Influence Maximization (IM) problem, that is, the task of selecting a subset of $k$ nodes to start the diffusion, so that the average information spread is maximized. Extensive experiments on synthetic and real-world networks demonstrate that the flexibility of the proposed class of GLT models, coupled with the proposed estimation and inference framework for its parameters, can significantly improve estimation of spread from a given subset of nodes, prediction of node activation, and the quality of the IM problem solutions.
