Table of Contents
Fetching ...

COVIDHealth: A Benchmark Twitter Dataset and Machine Learning based Web Application for Classifying COVID-19 Discussions

Mahathir Mohammad Bishal, Md. Rakibul Hassan Chowdory, Anik Das, Muhammad Ashad Kabir

TL;DR

This work introduces COVIDHealth, a labeled Twitter dataset of 6,667 COVID-19–related tweets categorized into health risks, prevention, symptoms, transmission, and treatment, and a CNN-based classifier evaluated against traditional ML methods. It combines data collection from the CORONAVIRUS Tweets Dataset, careful annotation, and multi-faceted preprocessing with both LIWC/POS and TF-IDF features, plus data augmentation to address class imbalance. The results show CNN on a balanced augmented dataset achieving the best F1 score around 90%, surpassing traditional models (e.g., SGD) and other DL approaches, while a web application prototype demonstrates practical deployment. The work provides a valuable resource for public health analytics and real-time monitoring, with future directions including larger-scale labeling, transfer learning, and cross-platform Generalization to enhance robustness and applicability.

Abstract

The COVID-19 pandemic has had adverse effects on both physical and mental health. During this pandemic, numerous studies have focused on gaining insights into health-related perspectives from social media. In this study, our primary objective is to develop a machine learning-based web application for automatically classifying COVID-19-related discussions on social media. To achieve this, we label COVID-19-related Twitter data, provide benchmark classification results, and develop a web application. We collected data using the Twitter API and labeled a total of 6,667 tweets into five different classes: health risks, prevention, symptoms, transmission, and treatment. We extracted features using various feature extraction methods and applied them to seven different traditional machine learning algorithms, including Decision Tree, Random Forest, Stochastic Gradient Descent, Adaboost, K-Nearest Neighbour, Logistic Regression, and Linear SVC. Additionally, we used four deep learning algorithms: LSTM, CNN, RNN, and BERT, for classification. Overall, we achieved a maximum F1 score of 90.43% with the CNN algorithm in deep learning. The Linear SVC algorithm exhibited the highest F1 score at 86.13%, surpassing other traditional machine learning approaches. Our study not only contributes to the field of health-related data analysis but also provides a valuable resource in the form of a web-based tool for efficient data classification, which can aid in addressing public health challenges and increasing awareness during pandemics. We made the dataset and application publicly available, which can be downloaded from this link https://github.com/Bishal16/COVID19-Health-Related-Data-Classification-Website.

COVIDHealth: A Benchmark Twitter Dataset and Machine Learning based Web Application for Classifying COVID-19 Discussions

TL;DR

This work introduces COVIDHealth, a labeled Twitter dataset of 6,667 COVID-19–related tweets categorized into health risks, prevention, symptoms, transmission, and treatment, and a CNN-based classifier evaluated against traditional ML methods. It combines data collection from the CORONAVIRUS Tweets Dataset, careful annotation, and multi-faceted preprocessing with both LIWC/POS and TF-IDF features, plus data augmentation to address class imbalance. The results show CNN on a balanced augmented dataset achieving the best F1 score around 90%, surpassing traditional models (e.g., SGD) and other DL approaches, while a web application prototype demonstrates practical deployment. The work provides a valuable resource for public health analytics and real-time monitoring, with future directions including larger-scale labeling, transfer learning, and cross-platform Generalization to enhance robustness and applicability.

Abstract

The COVID-19 pandemic has had adverse effects on both physical and mental health. During this pandemic, numerous studies have focused on gaining insights into health-related perspectives from social media. In this study, our primary objective is to develop a machine learning-based web application for automatically classifying COVID-19-related discussions on social media. To achieve this, we label COVID-19-related Twitter data, provide benchmark classification results, and develop a web application. We collected data using the Twitter API and labeled a total of 6,667 tweets into five different classes: health risks, prevention, symptoms, transmission, and treatment. We extracted features using various feature extraction methods and applied them to seven different traditional machine learning algorithms, including Decision Tree, Random Forest, Stochastic Gradient Descent, Adaboost, K-Nearest Neighbour, Logistic Regression, and Linear SVC. Additionally, we used four deep learning algorithms: LSTM, CNN, RNN, and BERT, for classification. Overall, we achieved a maximum F1 score of 90.43% with the CNN algorithm in deep learning. The Linear SVC algorithm exhibited the highest F1 score at 86.13%, surpassing other traditional machine learning approaches. Our study not only contributes to the field of health-related data analysis but also provides a valuable resource in the form of a web-based tool for efficient data classification, which can aid in addressing public health challenges and increasing awareness during pandemics. We made the dataset and application publicly available, which can be downloaded from this link https://github.com/Bishal16/COVID19-Health-Related-Data-Classification-Website.
Paper Structure (23 sections, 4 equations, 6 figures, 9 tables)

This paper contains 23 sections, 4 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Workflow of our proposed methodology
  • Figure 2: Word cloud representation of twitter dataset
  • Figure 3: Word cloud representation of five different classes of the COVIDHEALTH dataset
  • Figure 4: Confusion matrix for Linear SVC algorithm
  • Figure 5: Confusion matrix for CNN model
  • ...and 1 more figures