Sampled Datasets Risk Substantial Bias in the Identification of Political Polarization on Social Media

Gabriele Di Bona; Emma Fraxanet; Björn Komander; Andrea Lo Sasso; Virginia Morini; Antoine Vendeville; Max Falkenberg; Alessandro Galeazzi

Sampled Datasets Risk Substantial Bias in the Identification of Political Polarization on Social Media

Gabriele Di Bona, Emma Fraxanet, Björn Komander, Andrea Lo Sasso, Virginia Morini, Antoine Vendeville, Max Falkenberg, Alessandro Galeazzi

TL;DR

This study tackles how sampling Twitter data affects the measurement of political polarization, focusing on a 24-hour Polish Twittersphere case. It combines a full dataset with several sampling strategies and applies latent ideology estimation based on retweet interactions to quantify polarization, using metrics like Hartigan's diptest and the Wasserstein distance. The results show that small samples often misrepresent polarization, with a critical threshold around 40% data beyond which multimodality becomes detectable and distributions stabilize; keyword-based sampling can approximate the full distribution only if keywords are carefully chosen, whereas poorly chosen keywords introduce bias. The findings have practical implications for researchers and policymakers, highlighting why access to comprehensive data is essential for robust polarization analyses and informing the design of data-access guidelines under the EU Digital Services Act.

Abstract

Following recent policy changes by X (Twitter) and other social media platforms, user interaction data has become increasingly difficult to access. These restrictions are impeding robust research pertaining to social and political phenomena online, which is critical due to the profound impact social media platforms may have on our societies. Here, we investigate the reliability of polarization measures obtained from different samples of social media data by studying the structural polarization of the Polish political debate on Twitter over a 24-hour period. First, we show that the political discussion on Twitter is only a small subset of the wider Twitter discussion. Second, we find that large samples can be representative of the whole political discussion on a platform, but small samples consistently fail to accurately reflect the true structure of polarization online. Finally, we demonstrate that keyword-based samples can be representative if keywords are selected with great care, but that poorly selected keywords can result in substantial political bias in the sampled data. Our findings demonstrate that it is not possible to measure polarization in a reliable way with small, sampled datasets, highlighting why the current lack of research data is so problematic, and providing insight into the practical implementation of the European Union's Digital Service Act which aims to improve researchers' access to social media data.

Sampled Datasets Risk Substantial Bias in the Identification of Political Polarization on Social Media

TL;DR

Abstract

Paper Structure (19 sections, 5 equations, 6 figures, 1 table)

This paper contains 19 sections, 5 equations, 6 figures, 1 table.

Introduction
Related work
Materials and Methods
Dataset
Sampling techniques
Random sampling.
Keyword-based sampling.
Seed-based sampling.
Latent Ideology Estimation
Comparison metrics
Dip test and statistic.
Wasserstein distance.
Graph Measures
Results
The role of the political debate in Twitter discussion
...and 4 more sections

Figures (6)

Figure 1: Visualization of the Twitter co-occurrence network in Poland. Left: politicians only. Right: politicians and non-politicians. Blue circle are the PiS politicians, while orange circles correspond to PO politicians. The left section displays the co-occurrence network of the political debate in Polish tweets. This plot shows a clear polarization between the network nodes, with distinct political polarization. On the right side, the co-occurrence network of the entire debate in Polish tweets is shown. In this case, the polarization observed in the left section is not as evident, as it is merged within the broader context of the overall debate in Poland.
Figure 2: Comparison of ideology distributions obtained across different sampling strategies using the full 24-hour Poland dataset. Left: The figure shows distributions for: a) the full dataset with politicians as seeds for latent ideology estimation (blue), b) the dataset obtained through keyword- and seed-based sampling with politicians as seeds (orange) and c) uses the same strategy as b) but using all users as influencers in the latent ideology estimation (green). The right panel presents basic statistics for each distribution. The plot indicates that a) and b) result in bimodal distributions. Looking at the dimensions and considering the used strategies, b) is a partial sample of a), and thus keyword-based sampling is underrepresenting one of the peaks. In contrast, c) shows that when not limited to politicians as seeds, a third peak of moderate users is captured, resulting in a less polarized distribution.
Figure 3: Study of varying the percentage of accessed data using the Polish 24h dataset with politicians as seeds for latent ideology retrieval. Left: Ideology distributions for various percentages of partial data. The bimodal distribution collapses for percentages below 30%. Right: Corresponding metrics for each percentage (x-axis represents the number of retweets obtained). In order: Hartigan’s diptest with significance levels, Wasserstein distance to the 100% distribution, and relative size of the Largest Weakly Connected Component (LWCC). For 20% and 30% samples, multi-modality is not significant, with the bimodality statistic (D) increasing as the data percentage rises. The Wasserstein distance drops sharply for initial samples and stabilizes above 50% (around 2,000 retweets). The relative LWCC size indicates a dismantelling of the retweet network for the first two sub-samples.
Figure 4: Study of varying the percentage of top retweeted politicians as seeds using the Polish 24h dataset for latent ideology retrieval. Left: This figure shows ideology distributions for different percentages of seed politicians, revealing bimodality across all percentages but with variations below 40% of top politicians. Lower percentages than 3% do not produce viable results. Right: corresponding metrics for each percentage, with the x-axis representing the number of retweets obtained at different percentage levels. The metrics include Hartigan’s diptest with significance levels, Wasserstein distance to the 100% distribution, and the relative size of the Largest Weakly Connected Component (LWCC). For samples with fewer than 4,000 retweets, both the bimodality statistic (D) and the Wasserstein distance fluctuate. As the number of retweets increases to 4000 and higher, the bimodality statistic stabilizes around 0.15, and the Wasserstein distance decreases to 0. The relative size of the LWCC indicates that the network is only disconnected for the top 3% of influencers and less.
Figure 5: Keyword based samples can substantially bias the inferred distribution of ideological opinions on Twitter. Left: Using a broad range of political terms can closely approximate the ideology distribution computed using the full dataset and politicians as seed nodes. Right: Poorly selected keyword samples can result in significant bias in the identified ideology distribution, resulting in the overrepresentation of either the political left (yellow) or the political right (blue).
...and 1 more figures

Sampled Datasets Risk Substantial Bias in the Identification of Political Polarization on Social Media

TL;DR

Abstract

Sampled Datasets Risk Substantial Bias in the Identification of Political Polarization on Social Media

Authors

TL;DR

Abstract

Table of Contents

Figures (6)