Table of Contents
Fetching ...

Multi-Lingual Cyber Threat Detection in Tweets/X Using ML, DL, and LLM: A Comparative Analysis

Saydul Akbar Murad, Ashim Dahal, Nick Rahimi

TL;DR

The paper tackles multilingual cyber threat detection in tweets across English, Chinese, Russian, and Arabic, addressing the scarcity of cross-language evaluations. It systematically compares ML, DL, and LLM approaches on language-specific datasets and a combined multilingual dataset, using manual and polarity-based labeling. Random Forest excels among ML methods, while Bi-LSTM generally provides the best performance among DL models; on the integrated multilingual data, Bi-LSTM still outperforms the LLM (XLM-RoBERTa), highlighting the strength of sequential context modeling for multilingual threat classification. The authors release a multilingual dataset and demonstrate a robust framework that informs future work on multilingual cyber threat detection and fine-tuning of advanced LLMs. The work has practical significance for safer social platforms by enabling scalable, language-inclusive threat identification.

Abstract

Cyber threat detection has become an important area of focus in today's digital age due to the growing spread of fake information and harmful content on social media platforms such as Twitter (now 'X'). These cyber threats, often disguised within tweets, pose significant risks to individuals, communities, and even nations, emphasizing the need for effective detection systems. While previous research has explored tweet-based threats, much of the work is limited to specific languages, domains, or locations, or relies on single-model approaches, reducing their applicability to diverse real-world scenarios. To address these gaps, our study focuses on multi-lingual tweet cyber threat detection using a variety of advanced models. The research was conducted in three stages: (1) We collected and labeled tweet datasets in four languages English, Chinese, Russian, and Arabic employing both manual and polarity-based labeling methods to ensure high-quality annotations. (2) Each dataset was analyzed individually using machine learning (ML) and deep learning (DL) models to assess their performance on distinct languages. (3) Finally, we combined all four datasets into a single multi-lingual dataset and applied DL and large language model (LLM) architectures to evaluate their efficacy in identifying cyber threats across various languages. Our results show that among machine learning models, Random Forest (RF) attained the highest performance; however, the Bi-LSTM architecture consistently surpassed other DL and LLM architectures across all datasets. These findings underline the effectiveness of Bi-LSTM in multilingual cyber threat detection. The code for this paper can be found at this link: https://github.com/Mmurrad/Tweet-Data-Classification.git.

Multi-Lingual Cyber Threat Detection in Tweets/X Using ML, DL, and LLM: A Comparative Analysis

TL;DR

The paper tackles multilingual cyber threat detection in tweets across English, Chinese, Russian, and Arabic, addressing the scarcity of cross-language evaluations. It systematically compares ML, DL, and LLM approaches on language-specific datasets and a combined multilingual dataset, using manual and polarity-based labeling. Random Forest excels among ML methods, while Bi-LSTM generally provides the best performance among DL models; on the integrated multilingual data, Bi-LSTM still outperforms the LLM (XLM-RoBERTa), highlighting the strength of sequential context modeling for multilingual threat classification. The authors release a multilingual dataset and demonstrate a robust framework that informs future work on multilingual cyber threat detection and fine-tuning of advanced LLMs. The work has practical significance for safer social platforms by enabling scalable, language-inclusive threat identification.

Abstract

Cyber threat detection has become an important area of focus in today's digital age due to the growing spread of fake information and harmful content on social media platforms such as Twitter (now 'X'). These cyber threats, often disguised within tweets, pose significant risks to individuals, communities, and even nations, emphasizing the need for effective detection systems. While previous research has explored tweet-based threats, much of the work is limited to specific languages, domains, or locations, or relies on single-model approaches, reducing their applicability to diverse real-world scenarios. To address these gaps, our study focuses on multi-lingual tweet cyber threat detection using a variety of advanced models. The research was conducted in three stages: (1) We collected and labeled tweet datasets in four languages English, Chinese, Russian, and Arabic employing both manual and polarity-based labeling methods to ensure high-quality annotations. (2) Each dataset was analyzed individually using machine learning (ML) and deep learning (DL) models to assess their performance on distinct languages. (3) Finally, we combined all four datasets into a single multi-lingual dataset and applied DL and large language model (LLM) architectures to evaluate their efficacy in identifying cyber threats across various languages. Our results show that among machine learning models, Random Forest (RF) attained the highest performance; however, the Bi-LSTM architecture consistently surpassed other DL and LLM architectures across all datasets. These findings underline the effectiveness of Bi-LSTM in multilingual cyber threat detection. The code for this paper can be found at this link: https://github.com/Mmurrad/Tweet-Data-Classification.git.

Paper Structure

This paper contains 24 sections, 42 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Workflow of the Tweet Data Classification Process.
  • Figure 2: Workflow of the Tweet Data Classification Process.