Table of Contents
Fetching ...

Extracting Participation in Collective Action from Social Media

Arianna Pera, Luca Maria Aiello

TL;DR

This work tackles the scarcity of granular ground-truth data for participation in online collective action by introducing a suite of open-source, topic-agnostic text classifiers trained on crowdsourced Reddit annotations. It evaluates a spectrum of models—from dictionary and centroid baselines to BERT and Llama3-based large models—via a two-stage process (binary participation then multi-class level) and leverages data augmentation to overcome label scarcity. The approach demonstrates that small, domain-tuned models can closely match larger LLMs in binary detection and remain competitive in nuanced multi-class classification, with significantly lower compute costs. Applied to climate-action discussions and broader Reddit communities, the method reveals participation patterns that diverge from keyword-based proxies and aligns with sociopolitical demographics, offering a robust tool for computational social science and online-mobilization studies.

Abstract

Social media play a key role in mobilizing collective action, holding the potential for studying the pathways that lead individuals to actively engage in addressing global challenges. However, quantitative research in this area has been limited by the absence of granular and large-scale ground truth about the level of participation in collective action among individual social media users. To address this limitation, we present a novel suite of text classifiers designed to identify expressions of participation in collective action from social media posts, in a topic-agnostic fashion. Grounded in the theoretical framework of social movement mobilization, our classification captures participation and categorizes it into four levels: recognizing collective issues, engaging in calls-to-action, expressing intention of action, and reporting active involvement. We constructed a labeled training dataset of Reddit comments through crowdsourcing, which we used to train BERT classifiers and fine-tune Llama3 models. Our findings show that smaller language models can reliably detect expressions of participation (weighted F1=0.71), and rival larger models in capturing nuanced levels of participation. By applying our methodology to Reddit, we illustrate its effectiveness as a robust tool for characterizing online communities in innovative ways compared to topic modeling, stance detection, and keyword-based methods. Our framework contributes to Computational Social Science research by providing a new source of reliable annotations useful for investigating the social dynamics of collective action.

Extracting Participation in Collective Action from Social Media

TL;DR

This work tackles the scarcity of granular ground-truth data for participation in online collective action by introducing a suite of open-source, topic-agnostic text classifiers trained on crowdsourced Reddit annotations. It evaluates a spectrum of models—from dictionary and centroid baselines to BERT and Llama3-based large models—via a two-stage process (binary participation then multi-class level) and leverages data augmentation to overcome label scarcity. The approach demonstrates that small, domain-tuned models can closely match larger LLMs in binary detection and remain competitive in nuanced multi-class classification, with significantly lower compute costs. Applied to climate-action discussions and broader Reddit communities, the method reveals participation patterns that diverge from keyword-based proxies and aligns with sociopolitical demographics, offering a robust tool for computational social science and online-mobilization studies.

Abstract

Social media play a key role in mobilizing collective action, holding the potential for studying the pathways that lead individuals to actively engage in addressing global challenges. However, quantitative research in this area has been limited by the absence of granular and large-scale ground truth about the level of participation in collective action among individual social media users. To address this limitation, we present a novel suite of text classifiers designed to identify expressions of participation in collective action from social media posts, in a topic-agnostic fashion. Grounded in the theoretical framework of social movement mobilization, our classification captures participation and categorizes it into four levels: recognizing collective issues, engaging in calls-to-action, expressing intention of action, and reporting active involvement. We constructed a labeled training dataset of Reddit comments through crowdsourcing, which we used to train BERT classifiers and fine-tune Llama3 models. Our findings show that smaller language models can reliably detect expressions of participation (weighted F1=0.71), and rival larger models in capturing nuanced levels of participation. By applying our methodology to Reddit, we illustrate its effectiveness as a robust tool for characterizing online communities in innovative ways compared to topic modeling, stance detection, and keyword-based methods. Our framework contributes to Computational Social Science research by providing a new source of reliable annotations useful for investigating the social dynamics of collective action.
Paper Structure (46 sections, 5 figures, 10 tables)

This paper contains 46 sections, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Validation of the proposed approach. (a-d) comparison with topic modeling; (e) comparison with stance detection -- A for agree, D for disagree, N for neutral; (f) impact of collective action keywords on classification performance.
  • Figure 2: Comparison of climate change-focused comment percentages and participation in collective action levels across subreddits, x-axis values normalized.
  • Figure 3: Fraction of Reddit comments showing participation in collective action across subreddits with varying socio-political tendencies. The socio-political scores are divided into equally sized bins based on sample quantiles.
  • Figure A1: Example screenshot of the annotation task on MTurk.
  • Figure B2: Complete visual representation of the topic modeling application, resulting in 9 topics (plus noise).