Identifying Key Nodes for the Influence Spread using a Machine Learning Approach
Mateusz Stolarski, Adam Piróg, Piotr Bródka
TL;DR
The paper tackles identifying influential seed nodes to maximize spread under the Independent Cascade diffusion model. It introduces an enhanced ML framework with Smart Bins discretization for robust ground-truth labeling, a 14-feature centrality embedding augmented by diffusion probabilities, and evaluation across multiple real networks using LightGBM. Key findings show that Smart Bins improve label quality and stability, and cross-network generalization is feasible, especially within network families; feature analysis highlights out-degree and related measures as primary predictors. The work offers practical pathways for rapid, generalizable seed selection in applications like marketing and epidemic control, with future directions including graph neural networks and network-similarity guided training set design.
Abstract
The identification of key nodes in complex networks is an important topic in many network science areas. It is vital to a variety of real-world applications, including viral marketing, epidemic spreading and influence maximization. In recent years, machine learning algorithms have proven to outperform the conventional, centrality-based methods in accuracy and consistency, but this approach still requires further refinement. What information about the influencers can be extracted from the network? How can we precisely obtain the labels required for training? Can these models generalize well? In this paper, we answer these questions by presenting an enhanced machine learning-based framework for the influence spread problem. We focus on identifying key nodes for the Independent Cascade model, which is a popular reference method. Our main contribution is an improved process of obtaining the labels required for training by introducing 'Smart Bins' and proving their advantage over known methods. Next, we show that our methodology allows ML models to not only predict the influence of a given node, but to also determine other characteristics of the spreading process-which is another novelty to the relevant literature. Finally, we extensively test our framework and its ability to generalize beyond complex networks of different types and sizes, gaining important insight into the properties of these methods.
