Table of Contents
Fetching ...

BERTDetect: A Neural Topic Modelling Approach for Android Malware Detection

Nishavi Ranaweera, Jiarui Xu, Suranga Seneviratne, Aruna Seneviratne

TL;DR

Android malware detection benefits from complementary signals beyond static/dynamic code analysis. BERTDetect harnesses BERTopic neural topic modelling on Google Play app descriptions to form coherent function-based topic clusters, then uses per-topic One-Class SVMs trained on binary API-call usage to detect outliers. The approach yields more coherent topics and a $10\%$ relative improvement in F1 over baselines like LDA/CHABADA/G-CATA on the AndroCatSet, with a reported F1 of $0.54$ and TP rate of $50.89\%$. This metadata-driven method offers a lightweight, scalable complement to traditional分析, improving detection of unseen malware while maintaining practical accuracy.

Abstract

Web access today occurs predominantly through mobile devices, with Android representing a significant share of the mobile device market. This widespread usage makes Android a prime target for malicious attacks. Despite efforts to combat malicious attacks through tools like Google Play Protect and antivirus software, new and evolved malware continues to infiltrate Android devices. Source code analysis is effective but limited, as attackers quickly abandon old malware for new variants to evade detection. Therefore, there is a need for alternative methods that complement source code analysis. Prior research investigated clustering applications based on their descriptions and identified outliers in these clusters by API usage as malware. However, these works often used traditional techniques such as Latent Dirichlet Allocation (LDA) and k-means clustering, that do not capture the nuanced semantic structures present in app descriptions. To this end, in this paper, we propose BERTDetect, which leverages the BERTopic neural topic modelling to effectively capture the latent topics in app descriptions. The resulting topic clusters are comparatively more coherent than previous methods and represent the app functionalities well. Our results demonstrate that BERTDetect outperforms other baselines, achieving ~10% relative improvement in F1 score.

BERTDetect: A Neural Topic Modelling Approach for Android Malware Detection

TL;DR

Android malware detection benefits from complementary signals beyond static/dynamic code analysis. BERTDetect harnesses BERTopic neural topic modelling on Google Play app descriptions to form coherent function-based topic clusters, then uses per-topic One-Class SVMs trained on binary API-call usage to detect outliers. The approach yields more coherent topics and a relative improvement in F1 over baselines like LDA/CHABADA/G-CATA on the AndroCatSet, with a reported F1 of and TP rate of . This metadata-driven method offers a lightweight, scalable complement to traditional分析, improving detection of unseen malware while maintaining practical accuracy.

Abstract

Web access today occurs predominantly through mobile devices, with Android representing a significant share of the mobile device market. This widespread usage makes Android a prime target for malicious attacks. Despite efforts to combat malicious attacks through tools like Google Play Protect and antivirus software, new and evolved malware continues to infiltrate Android devices. Source code analysis is effective but limited, as attackers quickly abandon old malware for new variants to evade detection. Therefore, there is a need for alternative methods that complement source code analysis. Prior research investigated clustering applications based on their descriptions and identified outliers in these clusters by API usage as malware. However, these works often used traditional techniques such as Latent Dirichlet Allocation (LDA) and k-means clustering, that do not capture the nuanced semantic structures present in app descriptions. To this end, in this paper, we propose BERTDetect, which leverages the BERTopic neural topic modelling to effectively capture the latent topics in app descriptions. The resulting topic clusters are comparatively more coherent than previous methods and represent the app functionalities well. Our results demonstrate that BERTDetect outperforms other baselines, achieving ~10% relative improvement in F1 score.

Paper Structure

This paper contains 22 sections, 1 equation, 9 figures, 3 tables.

Figures (9)

  • Figure 1: BERTDetect Framework
  • Figure 2: Distribution of the first and second highest affinity values assigned to app descriptions in BERTopic. (a) illustrates that, for most app descriptions, the highest affinity values are close to 1.0, indicating a strong association with a single dominant topic. (b) displays the distribution of the second highest affinity values, which are predominantly close to zero.
  • Figure 3: CCDF of Topic Cluster Quality. The curves demonstrate how BERTopic consistently maintains higher quality topic clusters across different metrics compared to the baselines LDA and CHABADA.
  • Figure 4: Word clouds of the topic assignments of a malicious "Bridal makeup" app.
  • Figure 5: Word clouds of the topic assignments of a malicious "Event reminder" app.
  • ...and 4 more figures