Table of Contents
Fetching ...

Information diffusion assumptions can distort our understanding of social network dynamics

Matthew R. DeVerna, Francesco Pierri, Rachith Aiyappa, Diogo Pacheco, John Bryden, Filippo Menczer

TL;DR

This study investigates the implications of the common practice of ignoring reconstruction all together and proposes a novel reconstruction approach that allows us to evaluate the effects of different assumptions made during the cascade inference procedure.

Abstract

To analyze the flow of information online, experts often rely on platform-provided data from social media companies, which typically attribute all resharing actions to an original poster. This obscures the true dynamics of how information spreads online, as users can be exposed to content in various ways. While most researchers analyze data as it is provided by the platform and overlook this issue, some attempt to infer the structure of these information cascades. However, the absence of ground truth about actual diffusion cascades makes verifying the efficacy of these efforts impossible. This study investigates the implications of the common practice of ignoring reconstruction all together. Two case studies involving data from Twitter and Bluesky reveal that reconstructing cascades significantly alters the identification of influential users, therefore affecting downstream analyses in general. We also propose a novel reconstruction approach that allows us to evaluate the effects of different assumptions made during the cascade inference procedure. Analysis of the diffusion of over 40,000 true and false news stories on Twitter reveals that the assumptions made during the reconstruction procedure drastically distort both microscopic and macroscopic properties of cascade networks. This work highlights the challenges of studying information spreading processes on complex networks and has significant implications for the broader study of digital platforms.

Information diffusion assumptions can distort our understanding of social network dynamics

TL;DR

This study investigates the implications of the common practice of ignoring reconstruction all together and proposes a novel reconstruction approach that allows us to evaluate the effects of different assumptions made during the cascade inference procedure.

Abstract

To analyze the flow of information online, experts often rely on platform-provided data from social media companies, which typically attribute all resharing actions to an original poster. This obscures the true dynamics of how information spreads online, as users can be exposed to content in various ways. While most researchers analyze data as it is provided by the platform and overlook this issue, some attempt to infer the structure of these information cascades. However, the absence of ground truth about actual diffusion cascades makes verifying the efficacy of these efforts impossible. This study investigates the implications of the common practice of ignoring reconstruction all together. Two case studies involving data from Twitter and Bluesky reveal that reconstructing cascades significantly alters the identification of influential users, therefore affecting downstream analyses in general. We also propose a novel reconstruction approach that allows us to evaluate the effects of different assumptions made during the cascade inference procedure. Analysis of the diffusion of over 40,000 true and false news stories on Twitter reveals that the assumptions made during the reconstruction procedure drastically distort both microscopic and macroscopic properties of cascade networks. This work highlights the challenges of studying information spreading processes on complex networks and has significant implications for the broader study of digital platforms.

Paper Structure

This paper contains 18 sections, 4 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Cascade reconstruction with Probabilistic Diffusion Inference.(a): Hypothetical empirical data of a message cascade with an original post (blue cross) and a sequence of resharing actions (red circles) over time. Each post is associated with a timestamp (represented by the time sequence) and the number of followers of the resharing user (next to the user icon). (b): The naive cascade constructed from platform-provided data, which assumes that every user directly reshared the original post. (c): The true cascade, reflecting the actual parent-child relationships. Panels (d, e, f) demonstrate different cascade reconstructions when applying various PDI assumptions. The recency assumption (d) prioritizes users who reshared the content more recently, capturing temporal dynamics. The followers assumption (e) gives higher resharing likelihood to users with more followers, emphasizing popularity. Incorporating both assumptions (f) captures both temporal activity and popularity into the cascade reconstruction.
  • Figure 2: Effects of cascade reconstruction on a Twitter resharing network. (a) shows the naive network, while (b) displays a version of the same network reconstructed using PDI parameters $\gamma = 0.5$ and $\alpha = 2.0$. For illustration purposes, only nodes from the two largest communities are included. Node size reflects the number of retweets received by an account, with larger nodes representing more influential accounts. Node color represents the number of retweets an account has made, where red nodes indicate amplifiers that extensively retweet others' content.
  • Figure 3: Node influence is substantially affected by cascade reconstruction. Heat map cells display the mean Spearman's correlation $\rho$ between node strength values in naive and PDI-reconstructed networks, averaged over 100 versions of the reconstructed network at the specified parameter settings. A $\rho$ value of one means the reconstruction doesn't alter node influence, while values closer to zero suggest significant changes. The maximum standard deviation of correlation values for any parameter setting is 0.001 for Twitter and 0.003 for Bluesky (see the \ref{['sec:si']} for full statistics).
  • Figure 4: Resharing networks reconstructed using the PDI method show substantial shifts in node influence compared to those built from naive data, on both Bluesky and Twitter. Panels (a, b, c) present results for Bluesky, while panels (d, e, f) show results for Twitter. All panels reflect reconstructions using PDI parameters $\gamma = 0.25$ and $\alpha = 3.0$. (a, d): Comparison of node strength between a single version of the PDI-reconstructed network and the corresponding naive network. (b, e): Average change in node strength relative to naive strength, across all 100 PDI reconstructions. The red crosses show the median values. (c, f): Jaccard similarity between the top $k$% of influential nodes identified based on node strength from reconstructed and naive networks. Each point represents one of the 100 possible comparisons. Circle sizes in panels (a, b, d, e) represent the number of nodes at each point. For visualization purposes, we use the same size for all points with 500 or more nodes.
  • Figure 5: Cascades reconstructed in different ways are highly dissimilar, especially for larger cascades. Each panel shows the mean cascade similarity as a function of cascade size, with similarity measured using the Jaccard index. The panels correspond to different reconstruction parameter settings. Fit lines are generated using locally weighted robust smoothing of the $\sim$28k mean values, while points represent the means in 500 equally-sized x-axis bins. Error bars show 95% confidence intervals calculated from 1,000 bootstraps.
  • ...and 2 more figures