Multi-label Classification for Android Malware Based on Active Learning

Qijing Qiao; Ruitao Feng; Sen Chen; Fei Zhang; Xiaohong Li

Multi-label Classification for Android Malware Based on Active Learning

Qijing Qiao, Ruitao Feng, Sen Chen, Fei Zhang, Xiaohong Li

TL;DR

MLCDroid tackles the need for fine-grained Android malware understanding by introducing a $|L|=6$-label multi-label classification framework built on a $531$-feature dictionary. It combines a broad evaluation of $70$ MLC+BC algorithm pairs with a Detection-Training active-learning component that augments unlabeled data via pseudo-labeling, achieving up to $86.7\%$ on $DREBIN$ and $83.3\%$ on $VirusShare$. The work demonstrates that multi-label, behavior-focused classification can provide richer, actionable insights for security analysts and sets a foundation for scalable, data-efficient malware analysis.

Abstract

The existing malware classification approaches (i.e., binary and family classification) can barely benefit subsequent analysis with their outputs. Even the family classification approaches suffer from lacking a formal naming standard and an incomplete definition of malicious behaviors. More importantly, the existing approaches are powerless for one malware with multiple malicious behaviors, while this is a very common phenomenon for Android malware in the wild. So, neither of them can provide researchers with a direct and comprehensive enough understanding of malware. In this paper, we propose MLCDroid, an ML-based multi-label classification approach that can directly indicate the existence of pre-defined malicious behaviors. With an in-depth analysis, we summarize six basic malicious behaviors from real-world malware with security reports and construct a labeled dataset. We compare the results of 70 algorithm combinations to evaluate the effectiveness (best at 73.3%). Faced with the challenge of the expensive cost of data annotation, we further propose an active learning approach based on data augmentation, which can improve the overall accuracy to 86.7% with a data augmentation of 5,000+ high-quality samples from an unlabeled malware dataset. This is the first multi-label Android malware classification approach intending to provide more information on fine-grained malicious behaviors.

Multi-label Classification for Android Malware Based on Active Learning

TL;DR

MLCDroid tackles the need for fine-grained Android malware understanding by introducing a

-label multi-label classification framework built on a

-feature dictionary. It combines a broad evaluation of

MLC+BC algorithm pairs with a Detection-Training active-learning component that augments unlabeled data via pseudo-labeling, achieving up to

and

. The work demonstrates that multi-label, behavior-focused classification can provide richer, actionable insights for security analysts and sets a foundation for scalable, data-efficient malware analysis.

Abstract

Paper Structure (47 sections, 5 equations, 7 figures, 8 tables)

This paper contains 47 sections, 5 equations, 7 figures, 8 tables.

Introduction
Approach
Behavior Analysis, Feature Selection, and Data Annotation
Attack chain and behavior analysis
Label definition
Feature selection
Data annotation
Base MLC Model Construction
Detection-Training
Evaluation
Used Datasets
Manually labeled malware
DREBIN dataset
VirusShare dataset
Experimental Environment
...and 32 more sections

Figures (7)

Figure 1: An overview of MLCDroid
Figure 2: Behavior analysis and data annotation
Figure 3: Base MLC model construction with labeled dataset
Figure 4: Detection-Training: an active learning framework based on data augmentation
Figure 5: The plots of accuracy improvement and data augmentation on the DREBIN dataset
...and 2 more figures

Multi-label Classification for Android Malware Based on Active Learning

TL;DR

Abstract

Multi-label Classification for Android Malware Based on Active Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)