Detecting Android Malware: From Neural Embeddings to Hands-On Validation with BERTroid
Meryam Chaieb, Mostafa Anouar Ghorab, Mohamed Aymen Saied
TL;DR
This work tackles Android malware detection by marrying a transformer-based approach with permission-focused features. It introduces BERTroid, which fine-tunes BERT on permission strings and augments it with lightweight classification layers, validated via a manual protocol and cross-dataset testing (Drebin, Maldozer, Androzoo). The study demonstrates state-of-the-art performance (F1 ~0.999) and robust resilience to permission evolution, while acknowledging practical limits of manual analysis and the need for continuous data collection. These results suggest a scalable and accurate framework for proactive Android security, with potential extensions to ensemble methods and family-level malware classification. The combination of static permission signals and rigorous validation presents a strong contribution to practical malware defense in mobile ecosystems.
Abstract
As cyber threats and malware attacks increasingly alarm both individuals and businesses, the urgency for proactive malware countermeasures intensifies. This has driven a rising interest in automated machine learning solutions. Transformers, a cutting-edge category of attention-based deep learning methods, have demonstrated remarkable success. In this paper, we present BERTroid, an innovative malware detection model built on the BERT architecture. Overall, BERTroid emerged as a promising solution for combating Android malware. Its ability to outperform state-of-the-art solutions demonstrates its potential as a proactive defense mechanism against malicious software attacks. Additionally, we evaluate BERTroid on multiple datasets to assess its performance across diverse scenarios. In the dynamic landscape of cybersecurity, our approach has demonstrated promising resilience against the rapid evolution of malware on Android systems. While the machine learning model captures broad patterns, we emphasize the role of manual validation for deeper comprehension and insight into these behaviors. This human intervention is critical for discerning intricate and context-specific behaviors, thereby validating and reinforcing the model's findings.
