Table of Contents
Fetching ...

Benchmark Datasets for Lead-Lag Forecasting on Social Platforms

Kimia Kazemian, Zhenzhen Liu, Yangfanyu Yang, Katie Z Luo, Shuhan Gu, Audrey Du, Xinyu Yang, Jack Jansons, Kilian Q Weinberger, John Thickstun, Yian Yin, Sarah Dean

TL;DR

Lead-Lag Forecasting (LLF) addresses predicting long-horizon lag outcomes from early cross-channel signals on social platforms. The authors formalize LLF, propose two large-scale benchmarks (arXiv: accesses/→ citations and GitHub: pushes/stars/→ forks), and provide evaluation protocols for cross-channel and cross-series generalization with horizons up to $H=1825$ days. Baseline experiments show usable predictive signal from early signals, with cross-channel models outperforming within-channel baselines and Time-MoE embeddings offering gains in some settings; results establish LLF as a distinct forecasting paradigm and a foundation for scalable, cross-entity time-series research. The work highlights practical implications for early-hit identification in scholarly communication and software ecosystems, while noting limitations like aggregate data and evolving platform dynamics. The publicly available data portal enables community-driven development of LLF methods and benchmarks.

Abstract

Social and collaborative platforms emit multivariate time-series traces in which early interactions-such as views, likes, or downloads-are followed, sometimes months or years later, by higher impact like citations, sales, or reviews. We formalize this setting as Lead-Lag Forecasting (LLF): given an early usage channel (the lead), predict a correlated but temporally shifted outcome channel (the lag). Despite the ubiquity of such patterns, LLF has not been treated as a unified forecasting problem within the time-series community, largely due to the absence of standardized datasets. To anchor research in LLF, here we present two high-volume benchmark datasets-arXiv (accesses -> citations of 2.3M papers) and GitHub (pushes/stars -> forks of 3M repositories)-and outline additional domains with analogous lead-lag dynamics, including Wikipedia (page views -> edits), Spotify (streams -> concert attendance), e-commerce (click-throughs -> purchases), and LinkedIn profile (views -> messages). Our datasets provide ideal testbeds for lead-lag forecasting, by capturing long-horizon dynamics across years, spanning the full spectrum of outcomes, and avoiding survivorship bias in sampling. We documented all technical details of data curation and cleaning, verified the presence of lead-lag dynamics through statistical and classification tests, and benchmarked parametric and non-parametric baselines for regression. Our study establishes LLF as a novel forecasting paradigm and lays an empirical foundation for its systematic exploration in social and usage data. Our data portal with downloads and documentation is available at https://lead-lag-forecasting.github.io/.

Benchmark Datasets for Lead-Lag Forecasting on Social Platforms

TL;DR

Lead-Lag Forecasting (LLF) addresses predicting long-horizon lag outcomes from early cross-channel signals on social platforms. The authors formalize LLF, propose two large-scale benchmarks (arXiv: accesses/→ citations and GitHub: pushes/stars/→ forks), and provide evaluation protocols for cross-channel and cross-series generalization with horizons up to days. Baseline experiments show usable predictive signal from early signals, with cross-channel models outperforming within-channel baselines and Time-MoE embeddings offering gains in some settings; results establish LLF as a distinct forecasting paradigm and a foundation for scalable, cross-entity time-series research. The work highlights practical implications for early-hit identification in scholarly communication and software ecosystems, while noting limitations like aggregate data and evolving platform dynamics. The publicly available data portal enables community-driven development of LLF methods and benchmarks.

Abstract

Social and collaborative platforms emit multivariate time-series traces in which early interactions-such as views, likes, or downloads-are followed, sometimes months or years later, by higher impact like citations, sales, or reviews. We formalize this setting as Lead-Lag Forecasting (LLF): given an early usage channel (the lead), predict a correlated but temporally shifted outcome channel (the lag). Despite the ubiquity of such patterns, LLF has not been treated as a unified forecasting problem within the time-series community, largely due to the absence of standardized datasets. To anchor research in LLF, here we present two high-volume benchmark datasets-arXiv (accesses -> citations of 2.3M papers) and GitHub (pushes/stars -> forks of 3M repositories)-and outline additional domains with analogous lead-lag dynamics, including Wikipedia (page views -> edits), Spotify (streams -> concert attendance), e-commerce (click-throughs -> purchases), and LinkedIn profile (views -> messages). Our datasets provide ideal testbeds for lead-lag forecasting, by capturing long-horizon dynamics across years, spanning the full spectrum of outcomes, and avoiding survivorship bias in sampling. We documented all technical details of data curation and cleaning, verified the presence of lead-lag dynamics through statistical and classification tests, and benchmarked parametric and non-parametric baselines for regression. Our study establishes LLF as a novel forecasting paradigm and lays an empirical foundation for its systematic exploration in social and usage data. Our data portal with downloads and documentation is available at https://lead-lag-forecasting.github.io/.

Paper Structure

This paper contains 22 sections, 3 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Left: Histogram of paper publication time in the train split. Right: Distribution across coarse categories in the train split.
  • Figure 2: Left: Citation delay in days after first access. Right: Distribution of accesses and citations over time. Log-scaled histograms show cumulative accesses (top) and citations (bottom) at 30 days, 100 days, 1 year, and 5 years after publication. Colors progress from light to dark over time.
  • Figure 3: Left: Pearson correlation of early accesses (blue) and citations (orange) with 5-year citation count. Access data shows strong early correlation. After 3 months (90 days), citations gradually become more predictive. Right: Hexbin plots of log-transformed early accesses vs. five-year citations. Each subplot corresponds to the 30-, 100-, and 365-day access horizon. Color indicates the log-density of papers. These plots illustrate a clear positive association across all horizons, with the signal becoming sharper and more linear at longer horizons.
  • Figure 4: Left: Github Repository creation timeline. Right: Distribution across coarse packages.
  • Figure 5: Left: Lag between first activity and first star and first fork event. Right: Distributions of GitHub engagement signals across time horizons. Log-scaled histograms showing cumulative counts of pushes (top), stars (middle), and forks (bottom) measured at 30 days, 100 days, 1 year, and 5 years after repository creation. Colors progress from lighter to darker as time advances.
  • ...and 2 more figures