Table of Contents
Fetching ...

ActDroid: An active learning framework for Android malware detection

Ali Muzaffar, Hani Ragab Hassen, Hind Zantout, Michael A Lones

TL;DR

This paper reframes Android malware detection as a streaming problem and investigates how labeling delays (label availability after release) and concept drift degrade online learning performance. It introduces a novel active-learning framework that trains only on low-confidence samples and retrains upon drift, achieving up to 96% accuracy while using as little as 24–34% of labels. Through a comprehensive comparison of five online learners across static, dynamic, and hybrid feature sets, the study reveals that static API-call features drive high accuracy but incur high dimensionality, while permissions and opcodes offer robust, cost-effective alternatives; dynamic features can boost performance when combined but are expensive to extract. The work demonstrates that active learning effectively mitigates labeling delays, offering practical deployment guidance for Android malware detection systems in real-world, label-scarce environments, and highlights trade-offs among feature types, model choice, and labeling costs.

Abstract

The growing popularity of Android requires malware detection systems that can keep up with the pace of new software being released. According to a recent study, a new piece of malware appears online every 12 seconds. To address this, we treat Android malware detection as a streaming data problem and explore the use of active online learning as a means of mitigating the problem of labelling applications in a timely and cost-effective manner. Our resulting framework achieves accuracies of up to 96\%, requires as little of 24\% of the training data to be labelled, and compensates for concept drift that occurs between the release and labelling of an application. We also consider the broader practicalities of online learning within Android malware detection, and systematically explore the trade-offs between using different static, dynamic and hybrid feature sets to classify malware.

ActDroid: An active learning framework for Android malware detection

TL;DR

This paper reframes Android malware detection as a streaming problem and investigates how labeling delays (label availability after release) and concept drift degrade online learning performance. It introduces a novel active-learning framework that trains only on low-confidence samples and retrains upon drift, achieving up to 96% accuracy while using as little as 24–34% of labels. Through a comprehensive comparison of five online learners across static, dynamic, and hybrid feature sets, the study reveals that static API-call features drive high accuracy but incur high dimensionality, while permissions and opcodes offer robust, cost-effective alternatives; dynamic features can boost performance when combined but are expensive to extract. The work demonstrates that active learning effectively mitigates labeling delays, offering practical deployment guidance for Android malware detection systems in real-world, label-scarce environments, and highlights trade-offs among feature types, model choice, and labeling costs.

Abstract

The growing popularity of Android requires malware detection systems that can keep up with the pace of new software being released. According to a recent study, a new piece of malware appears online every 12 seconds. To address this, we treat Android malware detection as a streaming data problem and explore the use of active online learning as a means of mitigating the problem of labelling applications in a timely and cost-effective manner. Our resulting framework achieves accuracies of up to 96\%, requires as little of 24\% of the training data to be labelled, and compensates for concept drift that occurs between the release and labelling of an application. We also consider the broader practicalities of online learning within Android malware detection, and systematically explore the trade-offs between using different static, dynamic and hybrid feature sets to classify malware.
Paper Structure (27 sections, 14 figures, 4 tables)

This paper contains 27 sections, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Number of benign and malware applications in the dataset per day
  • Figure 2: Our active learning framework
  • Figure 3: Progressive validation results on OL models trained on static features
  • Figure 4: Progressive validation results on OL models trained on dynamic features
  • Figure 5: Progressive validation results on OL models trained on hybrid features
  • ...and 9 more figures