Comparing Methods for Creating a National Random Sample of Twitter Users
Meysam Alizadeh, Darya Zare, Zeynab Samei, Mohammadamin Alizadeh, Mael Kubli, Mohammadhadi Aliahmadi, Sarvenaz Ebrahimi, Fabrizio Gilardi
TL;DR
This study systematically compares four common methods for constructing a national random sample of Twitter users in the US, evaluating tweet-, user-, and population-level representativeness. Using a month-long data collection and a debiasing framework based on inclusion probabilities, the authors demonstrate that the 1% Streaming method most effectively yields population-representative samples, with Bounding Box serving as a viable fallback when streaming is not feasible. Across extensive robustness checks, the 1% Stream consistently achieves lower population-inference error (MAPE) than the other methods, even after accounting for demographic correlations. The work provides practical guidance for researchers conducting population-level Twitter analyses and highlights tradeoffs related to timeliness, engagement metrics, and regional biases. Its approach and findings can inform similar sampling and debiasing efforts on other social platforms and domains.
Abstract
Twitter data has been widely used by researchers across various social and computer science disciplines. A common aim when working with Twitter data is the construction of a random sample of users from a given country. However, while several methods have been proposed in the literature, their comparative performance is mostly unexplored. In this paper, we implement four common methods to collect a random sample of Twitter users in the US: 1% Stream, Bounding Box, Location Query, and Language Query. Then, we compare the methods according to their tweet- and user-level metrics as well as their accuracy in estimating US population with and without using inclusion probabilities of various demographics. Our results show that the 1% Stream method performs differently than others in tweet- and user-level metrics, and best for the construction of a population representative sample. We discuss the conditions under which the 1% Stream method may not be suitable and suggest the Bounding Box method as the second-best method to use.
