Table of Contents
Fetching ...

SenWave: A Fine-Grained Multi-Language Sentiment Analysis Dataset Sourced from COVID-19 Tweets

Qiang Yang, Xiuying Chen, Changsheng Ma, Rui Yin, Xin Gao, Xiangliang Zhang

TL;DR

SenWave delivers a large-scale, fine-grained, multilingual sentiment analysis dataset tailored for COVID-19 Twitter discourse. It combines 10k English and 10k Arabic annotations across ten labels with 105 million unlabeled tweets in five languages, augmented by translations to Spanish, French, and Italian, and evaluated using transformer-based multi-label classifiers. The work demonstrates robust annotation quality, Transformer superiority over baselines, and the utility of ChatGPT for zero-shot and few-shot evaluation, revealing nuanced temporal and cross-cultural sentiment dynamics. This resource enables nuanced crisis analytics for researchers and policymakers and provides a platform for cross-language sentiment research in complex events.

Abstract

The global impact of the COVID-19 pandemic has highlighted the need for a comprehensive understanding of public sentiment and reactions. Despite the availability of numerous public datasets on COVID-19, some reaching volumes of up to 100 billion data points, challenges persist regarding the availability of labeled data and the presence of coarse-grained or inappropriate sentiment labels. In this paper, we introduce SenWave, a novel fine-grained multi-language sentiment analysis dataset specifically designed for analyzing COVID-19 tweets, featuring ten sentiment categories across five languages. The dataset comprises 10,000 annotated tweets each in English and Arabic, along with 30,000 translated tweets in Spanish, French, and Italian, derived from English tweets. Additionally, it includes over 105 million unlabeled tweets collected during various COVID-19 waves. To enable accurate fine-grained sentiment classification, we fine-tuned pre-trained transformer-based language models using the labeled tweets. Our study provides an in-depth analysis of the evolving emotional landscape across languages, countries, and topics, revealing significant insights over time. Furthermore, we assess the compatibility of our dataset with ChatGPT, demonstrating its robustness and versatility in various applications. Our dataset and accompanying code are publicly accessible on the repository\footnote{https://github.com/gitdevqiang/SenWave}. We anticipate that this work will foster further exploration into fine-grained sentiment analysis for complex events within the NLP community, promoting more nuanced understanding and research innovations.

SenWave: A Fine-Grained Multi-Language Sentiment Analysis Dataset Sourced from COVID-19 Tweets

TL;DR

SenWave delivers a large-scale, fine-grained, multilingual sentiment analysis dataset tailored for COVID-19 Twitter discourse. It combines 10k English and 10k Arabic annotations across ten labels with 105 million unlabeled tweets in five languages, augmented by translations to Spanish, French, and Italian, and evaluated using transformer-based multi-label classifiers. The work demonstrates robust annotation quality, Transformer superiority over baselines, and the utility of ChatGPT for zero-shot and few-shot evaluation, revealing nuanced temporal and cross-cultural sentiment dynamics. This resource enables nuanced crisis analytics for researchers and policymakers and provides a platform for cross-language sentiment research in complex events.

Abstract

The global impact of the COVID-19 pandemic has highlighted the need for a comprehensive understanding of public sentiment and reactions. Despite the availability of numerous public datasets on COVID-19, some reaching volumes of up to 100 billion data points, challenges persist regarding the availability of labeled data and the presence of coarse-grained or inappropriate sentiment labels. In this paper, we introduce SenWave, a novel fine-grained multi-language sentiment analysis dataset specifically designed for analyzing COVID-19 tweets, featuring ten sentiment categories across five languages. The dataset comprises 10,000 annotated tweets each in English and Arabic, along with 30,000 translated tweets in Spanish, French, and Italian, derived from English tweets. Additionally, it includes over 105 million unlabeled tweets collected during various COVID-19 waves. To enable accurate fine-grained sentiment classification, we fine-tuned pre-trained transformer-based language models using the labeled tweets. Our study provides an in-depth analysis of the evolving emotional landscape across languages, countries, and topics, revealing significant insights over time. Furthermore, we assess the compatibility of our dataset with ChatGPT, demonstrating its robustness and versatility in various applications. Our dataset and accompanying code are publicly accessible on the repository\footnote{https://github.com/gitdevqiang/SenWave}. We anticipate that this work will foster further exploration into fine-grained sentiment analysis for complex events within the NLP community, promoting more nuanced understanding and research innovations.

Paper Structure

This paper contains 36 sections, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Random Examples of Labeled Tweets
  • Figure 2: Heatmaps of labels co-occurrence for English and Arabic tweets.
  • Figure 3: The absolute daily volume of COVID-19 Tweets collected in 5 languages, English (En), Spanish (Es), Arabic (Ar), French (Fr), and Italian (It). The vertical lines show Sundays, for guidance.
  • Figure 4: Sentiment variation of English tweets over time. The linear regression line of each emotion curve shows the trend of the emotion variation.
  • Figure 5: Sentiment variation in USA over time. Each bar shows the distribution of sentiments on one day (Better zoom in the spikes).
  • ...and 8 more figures