BotArtist: Generic approach for bot detection in Twitter via semi-automatic machine learning pipeline
Alexander Shevtsov, Despoina Antonakaki, Ioannis Lamprou, Polyvios Pratikakis, Sotiris Ioannidis
TL;DR
The paper tackles Twitter bot detection by introducing a semi-automatic machine learning pipeline (SAMLP) to build BotArtist, a profile-feature-based detector. It systematically balances feature selection, hyperparameter tuning, and model explainability (via SHAP) across nine public datasets, achieving an average F1-score of 83.19 and outperforming 35 baselines by up to ~10%. A key contribution is the release of one of the largest labeled Twitter bot datasets (10,929,533 profiles with BotArtist predictions linked to 127,275,386 tweets) to support future research, enabling robust generalization beyond topic- or language-specific data. The work emphasizes a lightweight, scalable approach with limited API dependency, while acknowledging potential evasion strategies and advocating further exploration with longer-term data and large language models for bot detection.
Abstract
Twitter, as one of the most popular social networks, provides a platform for communication and online discourse. Unfortunately, it has also become a target for bots and fake accounts, resulting in the spread of false information and manipulation. This paper introduces a semi-automatic machine learning pipeline (SAMLP) designed to address the challenges associated with machine learning model development. Through this pipeline, we develop a comprehensive bot detection model named BotArtist, based on user profile features. SAMLP leverages nine distinct publicly available datasets to train the BotArtist model. To assess BotArtist's performance against current state-of-the-art solutions, we evaluate 35 existing Twitter bot detection methods, each utilizing a diverse range of features. Our comparative evaluation of BotArtist and these existing methods, conducted across nine public datasets under standardized conditions, reveals that the proposed model outperforms existing solutions by almost 10% in terms of F1-score, achieving an average score of 83.19% and 68.5% over specific and general approaches, respectively. As a result of this research, we provide one of the largest labeled Twitter bot datasets. The dataset contains extracted features combined with BotArtist predictions for 10,929,533 Twitter user profiles, collected via Twitter API during the 2022 Russo-Ukrainian War over a 16-month period. This dataset was created based on [Shevtsov et al., 2022a] where the original authors share anonymized tweets discussing the Russo-Ukrainian war, totaling 127,275,386 tweets. The combination of the existing textual dataset and the provided labeled bot and human profiles will enable future development of more advanced bot detection large language models in the post-Twitter API era.
