Table of Contents
Fetching ...

A Domain-Agnostic Neurosymbolic Approach for Big Social Data Analysis: Evaluating Mental Health Sentiment on Social Media during COVID-19

Vedant Khandelwal, Manas Gaur, Ugur Kursuncu, Valerie Shalin, Amit Sheth

TL;DR

A neurosymbolic method that integrates neural networks with symbolic knowledge sources is introduced, improving the detection and interpretation of mental health-related tweets relevant to COVID-19 and outperforming purely data-driven models with an F1 score exceeding 92%.

Abstract

Monitoring public sentiment via social media is potentially helpful during health crises such as the COVID-19 pandemic. However, traditional frequency-based, data-driven neural network-based approaches can miss newly relevant content due to the evolving nature of language in a dynamically evolving environment. Human-curated symbolic knowledge sources, such as lexicons for standard language and slang terms, can potentially elevate social media signals in evolving language. We introduce a neurosymbolic method that integrates neural networks with symbolic knowledge sources, enhancing the detection and interpretation of mental health-related tweets relevant to COVID-19. Our method was evaluated using a corpus of large datasets (approximately 12 billion tweets, 2.5 million subreddit data, and 700k news articles) and multiple knowledge graphs. This method dynamically adapts to evolving language, outperforming purely data-driven models with an F1 score exceeding 92\%. This approach also showed faster adaptation to new data and lower computational demands than fine-tuning pre-trained large language models (LLMs). This study demonstrates the benefit of neurosymbolic methods in interpreting text in a dynamic environment for tasks such as health surveillance.

A Domain-Agnostic Neurosymbolic Approach for Big Social Data Analysis: Evaluating Mental Health Sentiment on Social Media during COVID-19

TL;DR

A neurosymbolic method that integrates neural networks with symbolic knowledge sources is introduced, improving the detection and interpretation of mental health-related tweets relevant to COVID-19 and outperforming purely data-driven models with an F1 score exceeding 92%.

Abstract

Monitoring public sentiment via social media is potentially helpful during health crises such as the COVID-19 pandemic. However, traditional frequency-based, data-driven neural network-based approaches can miss newly relevant content due to the evolving nature of language in a dynamically evolving environment. Human-curated symbolic knowledge sources, such as lexicons for standard language and slang terms, can potentially elevate social media signals in evolving language. We introduce a neurosymbolic method that integrates neural networks with symbolic knowledge sources, enhancing the detection and interpretation of mental health-related tweets relevant to COVID-19. Our method was evaluated using a corpus of large datasets (approximately 12 billion tweets, 2.5 million subreddit data, and 700k news articles) and multiple knowledge graphs. This method dynamically adapts to evolving language, outperforming purely data-driven models with an F1 score exceeding 92\%. This approach also showed faster adaptation to new data and lower computational demands than fine-tuning pre-trained large language models (LLMs). This study demonstrates the benefit of neurosymbolic methods in interpreting text in a dynamic environment for tasks such as health surveillance.

Paper Structure

This paper contains 38 sections, 7 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The architecture consists of three main components: B1 (Semantic Gap Management), B2 (Metadata Scoring), and B3 (Adaptive Classifier Training). In B1, employing shallow infusion, relevant data is gathered from diverse sources, including social media, news articles, and KBs. These processes include identifying new terms, updating lexicons, semantic filtering, location extraction, and key phrase/hashtag extraction. B2 involves semantic mapping and proximity for index score calculation, which is further used to obtain a classification label. In B3, featuring semi-deep knowledge infusion, classifiers are trained using extracted semantic data and metadata scores-based labels.
  • Figure 2: Overview of the SEDO framework for tweet classification related to mental health and addiction during COVID-19. The process begins with mapping tweet n-grams to the MHDA lexicon. The domain-specific language model generates initial word embeddings from these tweets. A correlation matrix $Q$ is computed over the tweet word vectors, while a cross-correlation matrix $Z$ is established between tweet word vectors and the MHDA lexicon. Additionally, a correlation matrix $P$ is derived over the MHDA lexicon. The SEDO procedure optimizes the matrices $P$, $Q$, and $Z$, modulating the word embeddings through a learned weight matrix. The modulated word embeddings are then utilized for tweet classification. The MHDA KBx integrates concepts from domain-specific resources, including the PHQ-9 lexicon, SNOMED-CT, ICD-10, MeSH terms, and the DAO, enriching the analysis with domain-specific vocabulary and relationships.
  • Figure 3: New terms from Twitter data
  • Figure 4: New terms from Reddit data
  • Figure 5: New terms from News articles