Table of Contents
Fetching ...

A Big Data Analytics System for Predicting Suicidal Ideation in Real-Time Based on Social Media Streaming Data

Mohamed A. Allayla, Serkan Ayvaz

TL;DR

This work addresses the need for real-time suicidal ideation detection from social media by proposing a scalable big-data architecture that combines batch training on Reddit data with real-time streaming from Twitter. The system leverages Apache Spark ML classifiers and multiple feature-extraction techniques (Unigram/Bigram with CV-IDF, TF-IDF) within a two-phase pipeline (batch processing and real-time prediction) orchestrated by Apache Kafka and Spark Structured Streaming, with results visualized in Power BI. The strongest batch results come from the MLP model using Unigram+Bigram with CV-IDF, achieving 93.47% accuracy and 98.12% AUC, and this model is deployed for real-time streaming predictions, where 764 tweets were processed and 9.29% were flagged as suicidal. The approach demonstrates a scalable, end-to-end framework capable of supporting timely public health interventions, with potential extensions to more languages and advanced neural architectures in future work.

Abstract

Online social media platforms have recently become integral to our society and daily routines. Every day, users worldwide spend a couple of hours on such platforms, expressing their sentiments and emotional state and contacting each other. Analyzing such huge amounts of data from these platforms can provide a clear insight into public sentiments and help detect their mental status. The early identification of these health condition risks may assist in preventing or reducing the number of suicide ideation and potentially saving people's lives. The traditional techniques have become ineffective in processing such streams and large-scale datasets. Therefore, the paper proposed a new methodology based on a big data architecture to predict suicidal ideation from social media content. The proposed approach provides a practical analysis of social media data in two phases: batch processing and real-time streaming prediction. The batch dataset was collected from the Reddit forum and used for model building and training, while streaming big data was extracted using Twitter streaming API and used for real-time prediction. After the raw data was preprocessed, the extracted features were fed to multiple Apache Spark ML classifiers: NB, LR, LinearSVC, DT, RF, and MLP. We conducted various experiments using various feature-extraction techniques with different testing scenarios. The experimental results of the batch processing phase showed that the features extracted of (Unigram + Bigram) + CV-IDF with MLP classifier provided high performance for classifying suicidal ideation, with an accuracy of 93.47%, and then applied for real-time streaming prediction phase.

A Big Data Analytics System for Predicting Suicidal Ideation in Real-Time Based on Social Media Streaming Data

TL;DR

This work addresses the need for real-time suicidal ideation detection from social media by proposing a scalable big-data architecture that combines batch training on Reddit data with real-time streaming from Twitter. The system leverages Apache Spark ML classifiers and multiple feature-extraction techniques (Unigram/Bigram with CV-IDF, TF-IDF) within a two-phase pipeline (batch processing and real-time prediction) orchestrated by Apache Kafka and Spark Structured Streaming, with results visualized in Power BI. The strongest batch results come from the MLP model using Unigram+Bigram with CV-IDF, achieving 93.47% accuracy and 98.12% AUC, and this model is deployed for real-time streaming predictions, where 764 tweets were processed and 9.29% were flagged as suicidal. The approach demonstrates a scalable, end-to-end framework capable of supporting timely public health interventions, with potential extensions to more languages and advanced neural architectures in future work.

Abstract

Online social media platforms have recently become integral to our society and daily routines. Every day, users worldwide spend a couple of hours on such platforms, expressing their sentiments and emotional state and contacting each other. Analyzing such huge amounts of data from these platforms can provide a clear insight into public sentiments and help detect their mental status. The early identification of these health condition risks may assist in preventing or reducing the number of suicide ideation and potentially saving people's lives. The traditional techniques have become ineffective in processing such streams and large-scale datasets. Therefore, the paper proposed a new methodology based on a big data architecture to predict suicidal ideation from social media content. The proposed approach provides a practical analysis of social media data in two phases: batch processing and real-time streaming prediction. The batch dataset was collected from the Reddit forum and used for model building and training, while streaming big data was extracted using Twitter streaming API and used for real-time prediction. After the raw data was preprocessed, the extracted features were fed to multiple Apache Spark ML classifiers: NB, LR, LinearSVC, DT, RF, and MLP. We conducted various experiments using various feature-extraction techniques with different testing scenarios. The experimental results of the batch processing phase showed that the features extracted of (Unigram + Bigram) + CV-IDF with MLP classifier provided high performance for classifying suicidal ideation, with an accuracy of 93.47%, and then applied for real-time streaming prediction phase.
Paper Structure (26 sections, 7 equations, 13 figures, 4 tables)

This paper contains 26 sections, 7 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Proposed methodology for predicting suicidal ideation on social media content
  • Figure 2: The primary steps in preprocessing the raw dataset
  • Figure 3: Word cloud representation of suicidal-related postings
  • Figure 4: Cord cloud representation of non-suicidal-related postings
  • Figure 5: Comparison of performance results of all classification algorithms with Unigram +TF-IDF features
  • ...and 8 more figures