Table of Contents
Fetching ...

Domain-based user embedding for competing events on social media

Wentao Xu, Kazutoshi Sasahara

TL;DR

This work introduces a domain-based user embedding method that leverages URL domain co-occurrence in retweet behavior to characterize polarized user clusters involved in competing events on social media. By constructing a domain co-occurrence network and deriving per-user embeddings via summing domain vectors learned with Node2vec, the approach achieves higher classification accuracy and macro-F1 across topics than network- or content-based baselines, while also reducing computational cost. The method enables intuitive visualizations of user similarity and boundary delineation between opposing groups, providing a practical tool for studying echo chambers and polarization dynamics. Its robustness to data sparsity and potential for integration with language models suggests broad applicability to computational social science analyses of social divide and information diffusion.

Abstract

Social divide and polarization have become significant societal issues. To understand the mechanisms behind these phenomena, social media analysis offers research opportunities in computational social science, where developing effective user embedding methods is essential for subsequent analysis. Traditionally, researchers have used predefined network-based user features (e.g., network size, degree, and centrality measures). However, because such measures may not capture the complex characteristics of social media users, in our study we developed a method for embedding users based on a URL domain co-occurrence network. This approach effectively represents social media users involved in competing events such as political campaigns and public health crises. We assessed the method's performance using binary classification tasks and datasets that covered topics associated with the COVID-19 infodemic, such as QAnon, Biden, and Ivermectin, among Twitter users. Our results revealed that user embeddings generated directly from the retweet network and/or based on language performed below expectations, whereas our domain-based embeddings outperformed those methods while reducing computation time. Therefore, domain-based embedding offers an accessible and effective method for characterizing social media users in competing events.

Domain-based user embedding for competing events on social media

TL;DR

This work introduces a domain-based user embedding method that leverages URL domain co-occurrence in retweet behavior to characterize polarized user clusters involved in competing events on social media. By constructing a domain co-occurrence network and deriving per-user embeddings via summing domain vectors learned with Node2vec, the approach achieves higher classification accuracy and macro-F1 across topics than network- or content-based baselines, while also reducing computational cost. The method enables intuitive visualizations of user similarity and boundary delineation between opposing groups, providing a practical tool for studying echo chambers and polarization dynamics. Its robustness to data sparsity and potential for integration with language models suggests broad applicability to computational social science analyses of social divide and information diffusion.

Abstract

Social divide and polarization have become significant societal issues. To understand the mechanisms behind these phenomena, social media analysis offers research opportunities in computational social science, where developing effective user embedding methods is essential for subsequent analysis. Traditionally, researchers have used predefined network-based user features (e.g., network size, degree, and centrality measures). However, because such measures may not capture the complex characteristics of social media users, in our study we developed a method for embedding users based on a URL domain co-occurrence network. This approach effectively represents social media users involved in competing events such as political campaigns and public health crises. We assessed the method's performance using binary classification tasks and datasets that covered topics associated with the COVID-19 infodemic, such as QAnon, Biden, and Ivermectin, among Twitter users. Our results revealed that user embeddings generated directly from the retweet network and/or based on language performed below expectations, whereas our domain-based embeddings outperformed those methods while reducing computation time. Therefore, domain-based embedding offers an accessible and effective method for characterizing social media users in competing events.
Paper Structure (16 sections, 2 equations, 5 figures, 6 tables)

This paper contains 16 sections, 2 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Retweet networks of three topics: (a) QAnon, in which green represents pro-QAnon users and magenta represents anti-QAnon users; (b) Biden, in which green represents Republicans and magenta represents Democrats; and (c) Ivermectin, in which green represents writers and mainstream news and magenta represents users diffusing misinformation about Ivermectin.
  • Figure 2: Example of a domain co-occurrence network. From a given list of domain co-occurrences, a bipartite graph is constructed (top) and subsequently projected onto a domain co-occurrence network. The figure is for demonstrative purposes only.
  • Figure 3: Architecture and procedures of the proposed model. A domain co-occurrence network is constructed with the mechanism exemplified in Figure 2. The Node2vec embedding for each user is calculated and input into the liner layer of a neural network, while the user representations are concatenated and sent into another linear layer. The dropout layer is used to reduce overfitting, and ReLU, used as an activation function, introduces nonlinearity. The final classification layer is obtained through the output layer.
  • Figure 4: Learning curves of selected topics with the domain-based user embedding for the topics of (a) QAnon, (b) Biden, and (c) Ivermectin.
  • Figure 5: t-SNE visualizations of domain-based and hashtag-based user embeddings: (a) and (b) for QAnon topic; (c) and (d) for Biden topic; (e) for Ivermectin topic (Hashtag-based user embedding is not available as explained in the text).