Scalable APT Malware Classification via Parallel Feature Extraction and GPU-Accelerated Learning
Noah Subedar, Taeui Kim, Saathwick Venkataramalingam
TL;DR
The study tackles scalable APT malware classification by leveraging opcode sequences extracted from binaries via Ghidra in headless mode, combined with parallel processing and GPU-accelerated learning. It introduces an automated pipeline that builds 1-gram and 2-gram datasets, applies variance-based feature selection, and tests traditional classifiers (SVM, KNN, Decision Tree) alongside a CNN model inspired by Deep Android Malware Detection. Metadata integration enhances classifier performance, with Decision Trees achieving top results on opcode-plus-metadata configurations and CNNs excelling when operating on opcode sequences alone. The results demonstrate feasible, high-throughput opcode-based malware attribution and highlight the trade-offs between traditional ML and deep learning approaches, particularly in the presence or absence of contextual metadata, with GPU acceleration enabling practical training times for large-scale datasets.
Abstract
This paper presents an underlying framework for both automating and accelerating malware classification, more specifically, mapping malicious executables to known Advanced Persistent Threat (APT) groups. The main feature of this analysis is the assembly-level instructions present in executables which are also known as opcodes. The collection of such opcodes on many malicious samples is a lengthy process; hence, open-source reverse engineering tools are used in tandem with scripts that leverage parallel computing to analyze multiple files at once. Traditional and deep learning models are applied to create models capable of classifying malware samples. One-gram and two-gram datasets are constructed and used to train models such as SVM, KNN, and Decision Tree; however, they struggle to provide adequate results without relying on metadata to support n-gram sequences. The computational limitations of such models are overcome with convolutional neural networks (CNNs) and heavily accelerated using graphical compute unit (GPU) resources.
