Beyond Leakage and Complexity: Towards Realistic and Efficient Information Cascade Prediction
Jie Peng, Rui Wang, Qiang Wang, Zhewei Wei, Bin Tong, Guan Wang
TL;DR
The paper tackles practical challenges in information cascade prediction by (1) enforcing a time-ordered data split to eliminate information leakage, (2) introducing Taoke, a feature-rich dataset with conversion signals for second-stage predictions, and (3) proposing CasTemp, a lightweight model that uses temporal walks, a Jaccard-based inter-cascade graph, and GRU-Attention with Temporal Decay to predict both future popularity and downstream conversions. The approach achieves state-of-the-art performance across multiple datasets while delivering substantial training speedups, and it particularly excels at second-stage conversion forecasting on Taoke. These contributions bridge the gap between academic cascade modeling and real-world business applications, enabling more reliable forecasting and actionable monetization insights. The work also highlights that removing leakage reveals the strong sufficiency of simpler architectures when genuinely modeling diffusion dynamics.
Abstract
Information cascade popularity prediction is a key problem in analyzing content diffusion in social networks. However, current related works suffer from three critical limitations: (1) temporal leakage in current evaluation--random cascade-based splits allow models to access future information, yielding unrealistic results; (2) feature-poor datasets that lack downstream conversion signals (e.g., likes, comments, or purchases), which limits more practical applications; (3) computational inefficiency of complex graph-based methods that require days of training for marginal gains. We systematically address these challenges from three perspectives: task setup, dataset construction, and model design. First, we propose a time-ordered splitting strategy that chronologically partitions data into consecutive windows, ensuring models are evaluated on genuine forecasting tasks without future information leakage. Second, we introduce Taoke, a large-scale e-commerce cascade dataset featuring rich promoter/product attributes and ground-truth purchase conversions--capturing the complete diffusion lifecycle from promotion to monetization. Third, we develop CasTemp, a lightweight framework that efficiently models cascade dynamics through temporal walks, Jaccard-based neighbor selection for inter-cascade dependencies, and GRU-based encoding with time-aware attention. Under leak-free evaluation, CasTemp achieves state-of-the-art performance across four datasets with orders-of-magnitude speedup. Notably, it excels at predicting second-stage popularity conversions--a practical task critical for real-world applications.
