DetectBERT: Towards Full App-Level Representation Learning to Detect Android Malware

Tiezhu Sun; Nadia Daoudi; Kisub Kim; Kevin Allix; Tegawendé F. Bissyandé; Jacques Klein

DetectBERT: Towards Full App-Level Representation Learning to Detect Android Malware

Tiezhu Sun, Nadia Daoudi, Kisub Kim, Kevin Allix, Tegawendé F. Bissyandé, Jacques Klein

TL;DR

DetectBERT tackles app-level Android malware detection by converting class-level DexBERT embeddings into a unified representation using correlated MIL. It augments each APK with a learnable category vector and employs Nyström Attention to capture inter-class dependencies before an MLP predicts malware likelihood, all while keeping DexBERT frozen to preserve pre-trained representations. On a large DexRay-based dataset, DetectBERT outperforms basic aggregation methods and state-of-the-art detectors, and it demonstrates strong temporal robustness to evolving threats. This work broadens the use of MIL for scalable, real-world app analysis and suggests that MIL-based frameworks can enhance other software-engineering tasks beyond malware detection.

Abstract

Recent advancements in ML and DL have significantly improved Android malware detection, yet many methodologies still rely on basic static analysis, bytecode, or function call graphs that often fail to capture complex malicious behaviors. DexBERT, a pre-trained BERT-like model tailored for Android representation learning, enriches class-level representations by analyzing Smali code extracted from APKs. However, its functionality is constrained by its inability to process multiple Smali classes simultaneously. This paper introduces DetectBERT, which integrates correlated Multiple Instance Learning (c-MIL) with DexBERT to handle the high dimensionality and variability of Android malware, enabling effective app-level detection. By treating class-level features as instances within MIL bags, DetectBERT aggregates these into a comprehensive app-level representation. Our evaluation demonstrates that DetectBERT not only surpasses existing state-of-the-art detection methods but also adapts to evolving malware threats. Moreover, the versatility of the DetectBERT framework holds promising potential for broader applications in app-level analysis and other software engineering tasks, offering new avenues for research and development.

DetectBERT: Towards Full App-Level Representation Learning to Detect Android Malware

TL;DR

Abstract

Paper Structure (18 sections, 2 theorems, 7 equations, 1 figure, 3 tables)

This paper contains 18 sections, 2 theorems, 7 equations, 1 figure, 3 tables.

Introduction
Background
Android Malware Detection
DexBERT
Multiple Instance Learning
Approach
Theoretical Foundations
DetectBERT
Study Design
Research Questions
Dataset
Empirical Setup
Evaluation Metrics
Experimental Results
RQ1: How does DetectBERT perform compared to basic feature aggregation methods in detecting Android malware?
...and 3 more sections

Key Result

theorem 1

Suppose $S : \chi \rightarrow \mathbb{R}$ is a continuous set function w.r.t Hausdorff distance rote1991computing$d_{H}(., .)$. $\forall \varepsilon > 0$, for any invertible map $P : \chi \rightarrow \mathbb{R}^{n}$, $\exists$ function $\sigma$ and $g$, such that for any set $X \in \chi$:

Figures (1)

Figure 1: Overview of DetectBERT Workflow. First, DexBERT produces Smali class embeddings as c-MIL instances. A category vector of the same size is then introduced as an additional instance. The Nyström Attention layer helps DetectBERT find correlations among instances, allowing the category vector to capture key information from class embeddings for malware detection. Lastly, this vector is processed in a fully connected layer to make the detection decision.

Theorems & Definitions (2)

theorem 1
theorem 2

DetectBERT: Towards Full App-Level Representation Learning to Detect Android Malware

TL;DR

Abstract

DetectBERT: Towards Full App-Level Representation Learning to Detect Android Malware

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (2)