Curating corpora with classifiers: A case study of clean energy sentiment online
Michael V. Arnold, Peter Sheridan Dodds, Christopher M. Danforth
TL;DR
The paper addresses the challenge of defining cross-domain boundaries in social media corpora for real-time public opinion by proposing a two-step pipeline: broad keyword queries to maximize recall, followed by fine-tuned transformer-based relevance classifiers to maximize precision. Using MPNet-based sentence embeddings, UMAP visualization, and multiple diagnostic tools, the authors demonstrate near-perfect relevance discrimination (F1 up to ~0.95) across solar, wind, and nuclear energy discourse, enabling robust ambient sentiment analyses and lexical comparisons. Core contributions include a detailed methodology for dataset construction, a comparative evaluation of several embedding-based classifiers, and the introduction of word-shift and allotaxonometry techniques to interpret linguistic differences between relevant and non-relevant tweet subsets. The approach offers a practical, scalable preprocessing step for large-scale, uncertain-boundary social media datasets with potential broad applicability beyond energy topics.
Abstract
Well curated, large-scale corpora of social media posts containing broad public opinion offer an alternative data source to complement traditional surveys. While surveys are effective at collecting representative samples and are capable of achieving high accuracy, they can be both expensive to run and lag public opinion by days or weeks. Both of these drawbacks could be overcome with a real-time, high volume data stream and fast analysis pipeline. A central challenge in orchestrating such a data pipeline is devising an effective method for rapidly selecting the best corpus of relevant documents for analysis. Querying with keywords alone often includes irrelevant documents that are not easily disambiguated with bag-of-words natural language processing methods. Here, we explore methods of corpus curation to filter irrelevant tweets using pre-trained transformer-based models, fine-tuned for our binary classification task on hand-labeled tweets. We are able to achieve F1 scores of up to 0.95. The low cost and high performance of fine-tuning such a model suggests that our approach could be of broad benefit as a pre-processing step for social media datasets with uncertain corpus boundaries.
