Extracting Participation in Collective Action from Social Media
Arianna Pera, Luca Maria Aiello
TL;DR
This work tackles the scarcity of granular ground-truth data for participation in online collective action by introducing a suite of open-source, topic-agnostic text classifiers trained on crowdsourced Reddit annotations. It evaluates a spectrum of models—from dictionary and centroid baselines to BERT and Llama3-based large models—via a two-stage process (binary participation then multi-class level) and leverages data augmentation to overcome label scarcity. The approach demonstrates that small, domain-tuned models can closely match larger LLMs in binary detection and remain competitive in nuanced multi-class classification, with significantly lower compute costs. Applied to climate-action discussions and broader Reddit communities, the method reveals participation patterns that diverge from keyword-based proxies and aligns with sociopolitical demographics, offering a robust tool for computational social science and online-mobilization studies.
Abstract
Social media play a key role in mobilizing collective action, holding the potential for studying the pathways that lead individuals to actively engage in addressing global challenges. However, quantitative research in this area has been limited by the absence of granular and large-scale ground truth about the level of participation in collective action among individual social media users. To address this limitation, we present a novel suite of text classifiers designed to identify expressions of participation in collective action from social media posts, in a topic-agnostic fashion. Grounded in the theoretical framework of social movement mobilization, our classification captures participation and categorizes it into four levels: recognizing collective issues, engaging in calls-to-action, expressing intention of action, and reporting active involvement. We constructed a labeled training dataset of Reddit comments through crowdsourcing, which we used to train BERT classifiers and fine-tune Llama3 models. Our findings show that smaller language models can reliably detect expressions of participation (weighted F1=0.71), and rival larger models in capturing nuanced levels of participation. By applying our methodology to Reddit, we illustrate its effectiveness as a robust tool for characterizing online communities in innovative ways compared to topic modeling, stance detection, and keyword-based methods. Our framework contributes to Computational Social Science research by providing a new source of reliable annotations useful for investigating the social dynamics of collective action.
