Table of Contents
Fetching ...

Leveraging Content Producer Networks and User Perception to Detect Online Discursive Communities

Stefano Guarino, Ayoub Mounim, Guido Caldarelli, Fabio Saracco

TL;DR

This work proposes a community-detection framework for online social networks that exploits this asymmetry by first identifying and clustering a set of leading users, and then extending the resulting labels to the broader user base.

Abstract

Online discussions are often characterized by strong behavioral asymmetries: a relatively small fraction of users actively produces content, while the majority primarily consumes and redistributes it. Here we propose a community-detection framework for online social networks that exploits this asymmetry by first identifying and clustering a set of leading users, and then extending the resulting labels to the broader user base. We introduce two complementary strategies to cluster leaders, one based on their mutual interactions and the other on audience overlap, both relying on entropy-based filtering to separate signal from noise. We evaluate the framework on three major Italian political debates on Twitter/X, using public figures--identified through the pre-2022 verification system--as leaders, and known affiliations of political actors as ground truth labels. Compared with standard baselines, the proposed approach yields more coherent and interpretable communities aligned with political structures, with the two variants respectively recovering parties and coalitions. Activity-based criteria for selecting leaders produce qualitatively similar but consistently weaker results, particularly at the coalition level. Overall, our findings show that creating statistically validated networks of publicly recognized figures, whose off-platform roles constrain and stabilize their online behavior, provide a strong basis to identify discursive communities on social media. Although developed for Twitter/X, the approach is conceptually general, as it leverages structural asymmetries common to many online platforms.

Leveraging Content Producer Networks and User Perception to Detect Online Discursive Communities

TL;DR

This work proposes a community-detection framework for online social networks that exploits this asymmetry by first identifying and clustering a set of leading users, and then extending the resulting labels to the broader user base.

Abstract

Online discussions are often characterized by strong behavioral asymmetries: a relatively small fraction of users actively produces content, while the majority primarily consumes and redistributes it. Here we propose a community-detection framework for online social networks that exploits this asymmetry by first identifying and clustering a set of leading users, and then extending the resulting labels to the broader user base. We introduce two complementary strategies to cluster leaders, one based on their mutual interactions and the other on audience overlap, both relying on entropy-based filtering to separate signal from noise. We evaluate the framework on three major Italian political debates on Twitter/X, using public figures--identified through the pre-2022 verification system--as leaders, and known affiliations of political actors as ground truth labels. Compared with standard baselines, the proposed approach yields more coherent and interpretable communities aligned with political structures, with the two variants respectively recovering parties and coalitions. Activity-based criteria for selecting leaders produce qualitatively similar but consistently weaker results, particularly at the coalition level. Overall, our findings show that creating statistically validated networks of publicly recognized figures, whose off-platform roles constrain and stabilize their online behavior, provide a strong basis to identify discursive communities on social media. Although developed for Twitter/X, the approach is conceptually general, as it leverages structural asymmetries common to many online platforms.
Paper Structure (15 sections, 20 equations, 14 figures, 4 tables)

This paper contains 15 sections, 20 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: $\mathrm{MonoDC}_{}$'s pipeline. Starting from the data from Twitter/X, we build a bipartite directed network between accounts and posts; a directed arrow from a user (a circle) to a post (a square) represents authorship, while in the opposite direction represents a retweet. Then, the content of the bipartite network is projected into a monopartite directed network of users. Such a network is further statistically validated using the expectations of a maximum entropy null model for bipartite directed networks. Then, the sub-network relative to the selected content creators (magenta circles) is analysed, and communities are detected using standard algorithms. Finally, the so-obtained labels are propagated to standard users (blue circles) through a label propagation algorithm. The icon representing Twitter/X data is obtained by superimposing the icons "Data Base #7994662" by Farrih Icon and "Hashtag #4315467" by Alex Burte, both from Noun Projects.
  • Figure 2: $\mathrm{BiDC}_{}$'s pipeline. Starting from the data from Twitter/X, we build a bipartite undirected network between the accounts of the content creators and the one of standard users; a link is present if the standard users retweeted the given content creator at least once. Then, the information of the bipartite network is projected into a monopartite network of content creators. Such a network is further statistically validated using the expectations of a maximum entropy null model for bipartite undirected networks. Finally, the so-obtained labels are propagated to standard users through a label propagation algorithm. The procedure was first proposed in Becatti2019d. The icon representing Twitter/X data is obtained by superimposing the icons "Data Base #7994662" by Farrih Icon and "Hashtag #4315467" by Alex Burte, both from Noun Projects.
  • Figure 3: Limits of standard methods. The $\mathrm{VM}_{1}$ between partitions obtained with off-the-shelf algorithms, restricted to the annotated users, and the parties' annotations. Each violin represents the distribution over 100 independent runs. Standard methods show a limited accuracy in recognizing the affiliation of renowned politicians.
  • Figure 4: Comparison of the performances of $\mathrm{MonoDC}_{V}$ and $\mathrm{BiDC}_{V}$ with standard methods. The $\mathrm{VM}_{1}$ between partitions of the annotated users, obtained with different methods, and the annotations at party (first row) and coalition (second row) level. Each violin represents the distribution over 100 independent runs, and we use the Louvain algorithm as a benchmark because it performs no worse than other algorithms (cfr. Fig. \ref{['fig:stability_accuracy_standard']}). $\mathrm{MonoDC}_{V}$ and $\mathrm{BiDC}_{V}$ systematically outperform Louvain, with $\mathrm{MonoDC}_{V}$ generally preferable in identifying parties while $\mathrm{BiDC}_{V}$ in identifying coalitions.
  • Figure 5: Comparison of the performances of $\mathrm{MonoDC}_{V}$ and $\mathrm{BiDC}_{V}$ with standard methods at different resolution levels. The $\mathrm{VM}_{\beta}$ between partitions of the annotated users, obtained with different methods, and the annotations, for different values of $\beta$ and both at the parties (top) and coalitions (bottom) level. Each point is the average over 100 runs of the method indicated by the colour and marker. Analogous results are obtained when considering the propagated annotations. When compared with manually annotated parties (top) and coalitions (bottom), $\mathrm{MonoDC}_{V}$ and $\mathrm{BiDC}_{V}$ exhibit a consistently greater average $\mathrm{VM}_{\beta}$ than standard methods even for $\beta\neq 1$, i.e. at essentially all scales at which the political spectrum can be observed. In this sense, standard algorithms, when executed on the entire retweet network, do not return sub- or super-groups of the political parties that $\mathrm{MonoDC}_{V}$ and $\mathrm{BiDC}_{V}$ cannot detect.
  • ...and 9 more figures