Word frequency and sentiment analysis of twitter messages during Coronavirus pandemic
Nikhil Kumar Rajput, Bhavya Ahuja Grover, Vipin Kumar Rathi, Riya Bansal
TL;DR
This study analyzes COVID-19–related Twitter discourse through two quantitative lenses: word-frequency analysis modeled by a power-law distribution and sentiment analysis via TextBlob. It demonstrates that unigrams, bigrams, and trigrams follow power-law patterns with high goodness-of-fit, while revealing a predominantly neutral public sentiment (approximately $90.97\%$) and smaller shares of positive ($6.45\%$) and negative ($2.57\%$) tweets. Using a COVID-19–focused Twitter dataset sourced from Kaggle and processed with NLP tools (NLTK, WordNetLemmatizer), the work outlines a full preprocessing and visualization pipeline. The findings shed light on how public discourse around the pandemic evolved on Twitter and provide a methodological baseline for rapid sentiment and frequency analyses in social media data.
Abstract
The COVID-19 epidemic has had a great impact on social media conversation, especially on sites like Twitter, which has emerged as a hub for public reaction and information sharing. This paper deals by analyzing a vast dataset of Twitter messages related to this disease, starting from January 2020. Two approaches were used: a statistical analysis of word frequencies and a sentiment analysis to gauge user attitudes. Word frequencies are modeled using unigrams, bigrams, and trigrams, with power law distribution as the fitting model. The validity of the model is confirmed through metrics like Sum of Squared Errors (SSE), R-squared ($R^2$), and Root Mean Squared Error (RMSE). High $R^2$ and low SSE/RMSE values indicate a good fit for the model. Sentiment analysis is conducted to understand the general emotional tone of Twitter users messages. The results reveal that a majority of tweets exhibit neutral sentiment polarity, with only 2.57\% expressing negative polarity.
