BERTDetect: A Neural Topic Modelling Approach for Android Malware Detection
Nishavi Ranaweera, Jiarui Xu, Suranga Seneviratne, Aruna Seneviratne
TL;DR
Android malware detection benefits from complementary signals beyond static/dynamic code analysis. BERTDetect harnesses BERTopic neural topic modelling on Google Play app descriptions to form coherent function-based topic clusters, then uses per-topic One-Class SVMs trained on binary API-call usage to detect outliers. The approach yields more coherent topics and a $10\%$ relative improvement in F1 over baselines like LDA/CHABADA/G-CATA on the AndroCatSet, with a reported F1 of $0.54$ and TP rate of $50.89\%$. This metadata-driven method offers a lightweight, scalable complement to traditional分析, improving detection of unseen malware while maintaining practical accuracy.
Abstract
Web access today occurs predominantly through mobile devices, with Android representing a significant share of the mobile device market. This widespread usage makes Android a prime target for malicious attacks. Despite efforts to combat malicious attacks through tools like Google Play Protect and antivirus software, new and evolved malware continues to infiltrate Android devices. Source code analysis is effective but limited, as attackers quickly abandon old malware for new variants to evade detection. Therefore, there is a need for alternative methods that complement source code analysis. Prior research investigated clustering applications based on their descriptions and identified outliers in these clusters by API usage as malware. However, these works often used traditional techniques such as Latent Dirichlet Allocation (LDA) and k-means clustering, that do not capture the nuanced semantic structures present in app descriptions. To this end, in this paper, we propose BERTDetect, which leverages the BERTopic neural topic modelling to effectively capture the latent topics in app descriptions. The resulting topic clusters are comparatively more coherent than previous methods and represent the app functionalities well. Our results demonstrate that BERTDetect outperforms other baselines, achieving ~10% relative improvement in F1 score.
