Sampled Datasets Risk Substantial Bias in the Identification of Political Polarization on Social Media
Gabriele Di Bona, Emma Fraxanet, Björn Komander, Andrea Lo Sasso, Virginia Morini, Antoine Vendeville, Max Falkenberg, Alessandro Galeazzi
TL;DR
This study tackles how sampling Twitter data affects the measurement of political polarization, focusing on a 24-hour Polish Twittersphere case. It combines a full dataset with several sampling strategies and applies latent ideology estimation based on retweet interactions to quantify polarization, using metrics like Hartigan's diptest and the Wasserstein distance. The results show that small samples often misrepresent polarization, with a critical threshold around 40% data beyond which multimodality becomes detectable and distributions stabilize; keyword-based sampling can approximate the full distribution only if keywords are carefully chosen, whereas poorly chosen keywords introduce bias. The findings have practical implications for researchers and policymakers, highlighting why access to comprehensive data is essential for robust polarization analyses and informing the design of data-access guidelines under the EU Digital Services Act.
Abstract
Following recent policy changes by X (Twitter) and other social media platforms, user interaction data has become increasingly difficult to access. These restrictions are impeding robust research pertaining to social and political phenomena online, which is critical due to the profound impact social media platforms may have on our societies. Here, we investigate the reliability of polarization measures obtained from different samples of social media data by studying the structural polarization of the Polish political debate on Twitter over a 24-hour period. First, we show that the political discussion on Twitter is only a small subset of the wider Twitter discussion. Second, we find that large samples can be representative of the whole political discussion on a platform, but small samples consistently fail to accurately reflect the true structure of polarization online. Finally, we demonstrate that keyword-based samples can be representative if keywords are selected with great care, but that poorly selected keywords can result in substantial political bias in the sampled data. Our findings demonstrate that it is not possible to measure polarization in a reliable way with small, sampled datasets, highlighting why the current lack of research data is so problematic, and providing insight into the practical implementation of the European Union's Digital Service Act which aims to improve researchers' access to social media data.
