Table of Contents
Fetching ...

DetectBERT: Towards Full App-Level Representation Learning to Detect Android Malware

Tiezhu Sun, Nadia Daoudi, Kisub Kim, Kevin Allix, Tegawendé F. Bissyandé, Jacques Klein

TL;DR

DetectBERT tackles app-level Android malware detection by converting class-level DexBERT embeddings into a unified representation using correlated MIL. It augments each APK with a learnable category vector and employs Nyström Attention to capture inter-class dependencies before an MLP predicts malware likelihood, all while keeping DexBERT frozen to preserve pre-trained representations. On a large DexRay-based dataset, DetectBERT outperforms basic aggregation methods and state-of-the-art detectors, and it demonstrates strong temporal robustness to evolving threats. This work broadens the use of MIL for scalable, real-world app analysis and suggests that MIL-based frameworks can enhance other software-engineering tasks beyond malware detection.

Abstract

Recent advancements in ML and DL have significantly improved Android malware detection, yet many methodologies still rely on basic static analysis, bytecode, or function call graphs that often fail to capture complex malicious behaviors. DexBERT, a pre-trained BERT-like model tailored for Android representation learning, enriches class-level representations by analyzing Smali code extracted from APKs. However, its functionality is constrained by its inability to process multiple Smali classes simultaneously. This paper introduces DetectBERT, which integrates correlated Multiple Instance Learning (c-MIL) with DexBERT to handle the high dimensionality and variability of Android malware, enabling effective app-level detection. By treating class-level features as instances within MIL bags, DetectBERT aggregates these into a comprehensive app-level representation. Our evaluation demonstrates that DetectBERT not only surpasses existing state-of-the-art detection methods but also adapts to evolving malware threats. Moreover, the versatility of the DetectBERT framework holds promising potential for broader applications in app-level analysis and other software engineering tasks, offering new avenues for research and development.

DetectBERT: Towards Full App-Level Representation Learning to Detect Android Malware

TL;DR

DetectBERT tackles app-level Android malware detection by converting class-level DexBERT embeddings into a unified representation using correlated MIL. It augments each APK with a learnable category vector and employs Nyström Attention to capture inter-class dependencies before an MLP predicts malware likelihood, all while keeping DexBERT frozen to preserve pre-trained representations. On a large DexRay-based dataset, DetectBERT outperforms basic aggregation methods and state-of-the-art detectors, and it demonstrates strong temporal robustness to evolving threats. This work broadens the use of MIL for scalable, real-world app analysis and suggests that MIL-based frameworks can enhance other software-engineering tasks beyond malware detection.

Abstract

Recent advancements in ML and DL have significantly improved Android malware detection, yet many methodologies still rely on basic static analysis, bytecode, or function call graphs that often fail to capture complex malicious behaviors. DexBERT, a pre-trained BERT-like model tailored for Android representation learning, enriches class-level representations by analyzing Smali code extracted from APKs. However, its functionality is constrained by its inability to process multiple Smali classes simultaneously. This paper introduces DetectBERT, which integrates correlated Multiple Instance Learning (c-MIL) with DexBERT to handle the high dimensionality and variability of Android malware, enabling effective app-level detection. By treating class-level features as instances within MIL bags, DetectBERT aggregates these into a comprehensive app-level representation. Our evaluation demonstrates that DetectBERT not only surpasses existing state-of-the-art detection methods but also adapts to evolving malware threats. Moreover, the versatility of the DetectBERT framework holds promising potential for broader applications in app-level analysis and other software engineering tasks, offering new avenues for research and development.
Paper Structure (18 sections, 2 theorems, 7 equations, 1 figure, 3 tables)

This paper contains 18 sections, 2 theorems, 7 equations, 1 figure, 3 tables.

Key Result

theorem 1

Suppose $S : \chi \rightarrow \mathbb{R}$ is a continuous set function w.r.t Hausdorff distance rote1991computing$d_{H}(., .)$. $\forall \varepsilon > 0$, for any invertible map $P : \chi \rightarrow \mathbb{R}^{n}$, $\exists$ function $\sigma$ and $g$, such that for any set $X \in \chi$:

Figures (1)

  • Figure 1: Overview of DetectBERT Workflow. First, DexBERT produces Smali class embeddings as c-MIL instances. A category vector of the same size is then introduced as an additional instance. The Nyström Attention layer helps DetectBERT find correlations among instances, allowing the category vector to capture key information from class embeddings for malware detection. Lastly, this vector is processed in a fully connected layer to make the detection decision.

Theorems & Definitions (2)

  • theorem 1
  • theorem 2