Modelling the Spread of New Information on Social Networks
Ziming Xu, Shi Zhou, Vasileios Lampos, Ingemar J. Cox
TL;DR
This paper tackles the challenge of predicting reposting of information on social networks when the topic is previously unseen, highlighting a critical out-of-distribution generalisation gap. It shows that theories and models that rely on message content alone underperform when faced with new hashtags, whereas incorporating user profile information and historical behavior markedly improves predictive accuracy, achieving substantial gains in F1 on OOD tasks. The authors introduce a comprehensive dataset of 14 trending hashtags with rich user and message features, and compare decision-tree, neural, and BERT-based approaches, finding that user-centric models can surpass text-based baselines, with DT-U/NN-U performing remarkably well in OOD settings. The work underscores the importance of evaluating both in-distribution and out-of-distribution performance for diffusion-related tasks and suggests that understanding who users are and how they have acted previously may be more predictive of reposting to unseen topics than the content itself. Practically, these results inform design choices for moderation, information diffusion modelling, and platform interventions by prioritising user history signals over raw message content for unseen-topic spread prediction.
Abstract
There has been considerable interest in modelling the spread of information on social networks using machine learning models. Here, we consider the problem of predicting the spread of new information, i.e. when a user propagates information about a topic previously unseen by the user. In existing work, information and users are randomly assigned to a test or training set, ensuring that both sets are drawn from the same distribution. In the spread of new information, the problem becomes an out-of-distribution generalisation classification task. Our experimental results reveal that while existing algorithms, which predominantly use features derived from the content of messages, perform well when the training and test distributions are the same, these algorithms perform much worse when the test set is out-of-distribution, i.e. when the topic (hashtag) of the testing data is absent from the training data. We then show that if the message features are supplemented or replaced with features derived from users' profile and past behaviour, the out-of-distribution prediction is greatly improved, with the F1 score increasing from 0.117 to 0.705. Our experimental results suggest that a significant component of reposting behaviour for previously unseen topics can be predicted from users' profile and past behaviour, and is largely content-agnostic.
