Table of Contents
Fetching ...

Curating corpora with classifiers: A case study of clean energy sentiment online

Michael V. Arnold, Peter Sheridan Dodds, Christopher M. Danforth

TL;DR

The paper addresses the challenge of defining cross-domain boundaries in social media corpora for real-time public opinion by proposing a two-step pipeline: broad keyword queries to maximize recall, followed by fine-tuned transformer-based relevance classifiers to maximize precision. Using MPNet-based sentence embeddings, UMAP visualization, and multiple diagnostic tools, the authors demonstrate near-perfect relevance discrimination (F1 up to ~0.95) across solar, wind, and nuclear energy discourse, enabling robust ambient sentiment analyses and lexical comparisons. Core contributions include a detailed methodology for dataset construction, a comparative evaluation of several embedding-based classifiers, and the introduction of word-shift and allotaxonometry techniques to interpret linguistic differences between relevant and non-relevant tweet subsets. The approach offers a practical, scalable preprocessing step for large-scale, uncertain-boundary social media datasets with potential broad applicability beyond energy topics.

Abstract

Well curated, large-scale corpora of social media posts containing broad public opinion offer an alternative data source to complement traditional surveys. While surveys are effective at collecting representative samples and are capable of achieving high accuracy, they can be both expensive to run and lag public opinion by days or weeks. Both of these drawbacks could be overcome with a real-time, high volume data stream and fast analysis pipeline. A central challenge in orchestrating such a data pipeline is devising an effective method for rapidly selecting the best corpus of relevant documents for analysis. Querying with keywords alone often includes irrelevant documents that are not easily disambiguated with bag-of-words natural language processing methods. Here, we explore methods of corpus curation to filter irrelevant tweets using pre-trained transformer-based models, fine-tuned for our binary classification task on hand-labeled tweets. We are able to achieve F1 scores of up to 0.95. The low cost and high performance of fine-tuning such a model suggests that our approach could be of broad benefit as a pre-processing step for social media datasets with uncertain corpus boundaries.

Curating corpora with classifiers: A case study of clean energy sentiment online

TL;DR

The paper addresses the challenge of defining cross-domain boundaries in social media corpora for real-time public opinion by proposing a two-step pipeline: broad keyword queries to maximize recall, followed by fine-tuned transformer-based relevance classifiers to maximize precision. Using MPNet-based sentence embeddings, UMAP visualization, and multiple diagnostic tools, the authors demonstrate near-perfect relevance discrimination (F1 up to ~0.95) across solar, wind, and nuclear energy discourse, enabling robust ambient sentiment analyses and lexical comparisons. Core contributions include a detailed methodology for dataset construction, a comparative evaluation of several embedding-based classifiers, and the introduction of word-shift and allotaxonometry techniques to interpret linguistic differences between relevant and non-relevant tweet subsets. The approach offers a practical, scalable preprocessing step for large-scale, uncertain-boundary social media datasets with potential broad applicability beyond energy topics.

Abstract

Well curated, large-scale corpora of social media posts containing broad public opinion offer an alternative data source to complement traditional surveys. While surveys are effective at collecting representative samples and are capable of achieving high accuracy, they can be both expensive to run and lag public opinion by days or weeks. Both of these drawbacks could be overcome with a real-time, high volume data stream and fast analysis pipeline. A central challenge in orchestrating such a data pipeline is devising an effective method for rapidly selecting the best corpus of relevant documents for analysis. Querying with keywords alone often includes irrelevant documents that are not easily disambiguated with bag-of-words natural language processing methods. Here, we explore methods of corpus curation to filter irrelevant tweets using pre-trained transformer-based models, fine-tuned for our binary classification task on hand-labeled tweets. We are able to achieve F1 scores of up to 0.95. The low cost and high performance of fine-tuning such a model suggests that our approach could be of broad benefit as a pre-processing step for social media datasets with uncertain corpus boundaries.
Paper Structure (14 sections, 2 equations, 6 figures, 2 tables)

This paper contains 14 sections, 2 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Embedded tweet distribution plot for the combined datasets. Using a pre-trained model for semantically meaningful sentence embeddings based on MPNet, we plot the distribution of tweets within this semantic space. In both plots, points are tweets projected into 2D using UMAP for dimensionality reduction mcinnes2018umap. In panel A, we perform density based, hierarchical clustering using HDBSCAN and color by cluster. In panel B, we color by both the keyword used to query and the classification as relevant or non-relevant to the topic of clean energy. Relevant tweets containing the keywords 'wind', 'solar', and, to a lesser extent, 'nuclear' are relatively close together on the right in the embeddings, while non-relevant tweets are more dispersed.
  • Figure 2: Ambient sentiment time series comparison for relevant (R), non-relevant (NR), and combined tweet corpora, containing the keyword 'solar'. In the top panel, we show the number of tokens with LabMT dodds2015human sentiment scores in each corpus on each day. 'Relevant' tweets, in blue, have more scored tokens early on, but the number tokens in 'non-relevant' tweets increase in relative proportion over time. The center panel shows the average sentiment for each corpus, including a measurement of English language tweets as a whole in gray for comparison. Before 2019, the measured sentiment for both corpora are comparable, but subsequently the mean sentiment of 'non-relevant' tweets drops. In the bottom panel we plot the standard deviation of the sentiment measurement, which captures a broader distribution of sentiment scores for 'non-relevant' tweets. Without classification filtering, the ambient sentiment measurement would be entirely misleading, appearing as though the sentiment contained in tweets containing the word 'solar' dropped dramatically in 2019, when in fact sentiment has only modestly declined.
  • Figure 3: Ambient sentiment time series comparison for relevant (R), non-relevant (NR), and combined tweet corpora, all containing the keyword 'wind'. In the top panel, we show the number of tokens with LabMT sentiment scores for each corpus during each two week period dodds2015human. R tweets, in blue, have more than an order of magnitude fewer tokens per time window over the entire study period. The center panel shows the average sentiment for each corpus, including measurement of English language tweets as a whole in gray for comparison. R 'wind' tweets are more positive than Twitter on average early on, but this difference is reduced over time. Because most 'wind' tweets are non-relevant, sentiment of the combined corpus closely follows the NR sentiment. In the bottom panel we plot the standard deviation of the sentiment measurement, which captures a broader distribution of sentiment scores for 'non-relevant' tweets, as was the case for all case-studies we examined. Without classification filtering, the ambient sentiment measurement would have been dominated by NR tweets.
  • Figure 4: Ambient sentiment time series comparison for relevant (R), non-relevant (NR), and combined tweet corpora, all containing the keyword 'nuclear'. In the top panel, we show the number of tokens with LabMT dodds2015human sentiment scores for each corpus in each two week period. The number of relevant n-grams, in blue, is consistently lower than non-relevant n-grams. The center panel shows the average sentiment for each corpus, including measurement of English language tweets as a whole in gray. We found that R tweets had higher sentiment than NR tweets containing 'nuclear', but had much lower sentiment than Twitter as a whole. Sentiment appears relatively stable for both corpora with periods of higher sentiment around 2017 and 2020-2022 for the R corpus. In the bottom panel, we plot the standard deviation of the sentiment measurement, which shows a broader distribution of sentiment scores for NR tweets, as well as sentiment for both corpora trending down slightly.
  • Figure 5: Sentiment shift plots comparing the classified relevant (R) and non-relevant (NR) tweet corpora for tweets containing the keywords 'solar', 'wind', and 'nuclear'. We show the top 20 words contributing to the difference in LabMT sentiment between the corpora. A. Relevant tweets that are related to clean energy are more positive on average for all keywords when compared to non-relevant tweets. Sad words that are less common in relevant 'solar' tweets are 'radiation', 'pressure', and 'humidity', which largely refer to the weather. Happy words like 'energy' and 'power' are more common in relevant tweets compared to tweets non-relevant to solar energy. B. For 'wind', relatively sad terms like 'humidity' and 'pressure' are less common in relevant tweets (these appear in clearly non-related tweets about the weather), while happy terms like 'energy', 'power', and 'solar' are more common in tweets relevant to wind as a renewable energy source. C. For 'nuclear', relevant tweets are on average more positive due to sad words like 'war', 'weapons', and 'bomb' being less common in relevant tweets, while happy words like 'power' and 'energy' are more common. The two prominent sad words 'nuclear' and 'waste' go against the positive difference in moving from non-relevant to relevant tweets as they both occur more frequently in relevant tweets.
  • ...and 1 more figures