DetectBERT: Towards Full App-Level Representation Learning to Detect Android Malware
Tiezhu Sun, Nadia Daoudi, Kisub Kim, Kevin Allix, Tegawendé F. Bissyandé, Jacques Klein
TL;DR
DetectBERT tackles app-level Android malware detection by converting class-level DexBERT embeddings into a unified representation using correlated MIL. It augments each APK with a learnable category vector and employs Nyström Attention to capture inter-class dependencies before an MLP predicts malware likelihood, all while keeping DexBERT frozen to preserve pre-trained representations. On a large DexRay-based dataset, DetectBERT outperforms basic aggregation methods and state-of-the-art detectors, and it demonstrates strong temporal robustness to evolving threats. This work broadens the use of MIL for scalable, real-world app analysis and suggests that MIL-based frameworks can enhance other software-engineering tasks beyond malware detection.
Abstract
Recent advancements in ML and DL have significantly improved Android malware detection, yet many methodologies still rely on basic static analysis, bytecode, or function call graphs that often fail to capture complex malicious behaviors. DexBERT, a pre-trained BERT-like model tailored for Android representation learning, enriches class-level representations by analyzing Smali code extracted from APKs. However, its functionality is constrained by its inability to process multiple Smali classes simultaneously. This paper introduces DetectBERT, which integrates correlated Multiple Instance Learning (c-MIL) with DexBERT to handle the high dimensionality and variability of Android malware, enabling effective app-level detection. By treating class-level features as instances within MIL bags, DetectBERT aggregates these into a comprehensive app-level representation. Our evaluation demonstrates that DetectBERT not only surpasses existing state-of-the-art detection methods but also adapts to evolving malware threats. Moreover, the versatility of the DetectBERT framework holds promising potential for broader applications in app-level analysis and other software engineering tasks, offering new avenues for research and development.
