Table of Contents
Fetching ...

Scalable APT Malware Classification via Parallel Feature Extraction and GPU-Accelerated Learning

Noah Subedar, Taeui Kim, Saathwick Venkataramalingam

TL;DR

The study tackles scalable APT malware classification by leveraging opcode sequences extracted from binaries via Ghidra in headless mode, combined with parallel processing and GPU-accelerated learning. It introduces an automated pipeline that builds 1-gram and 2-gram datasets, applies variance-based feature selection, and tests traditional classifiers (SVM, KNN, Decision Tree) alongside a CNN model inspired by Deep Android Malware Detection. Metadata integration enhances classifier performance, with Decision Trees achieving top results on opcode-plus-metadata configurations and CNNs excelling when operating on opcode sequences alone. The results demonstrate feasible, high-throughput opcode-based malware attribution and highlight the trade-offs between traditional ML and deep learning approaches, particularly in the presence or absence of contextual metadata, with GPU acceleration enabling practical training times for large-scale datasets.

Abstract

This paper presents an underlying framework for both automating and accelerating malware classification, more specifically, mapping malicious executables to known Advanced Persistent Threat (APT) groups. The main feature of this analysis is the assembly-level instructions present in executables which are also known as opcodes. The collection of such opcodes on many malicious samples is a lengthy process; hence, open-source reverse engineering tools are used in tandem with scripts that leverage parallel computing to analyze multiple files at once. Traditional and deep learning models are applied to create models capable of classifying malware samples. One-gram and two-gram datasets are constructed and used to train models such as SVM, KNN, and Decision Tree; however, they struggle to provide adequate results without relying on metadata to support n-gram sequences. The computational limitations of such models are overcome with convolutional neural networks (CNNs) and heavily accelerated using graphical compute unit (GPU) resources.

Scalable APT Malware Classification via Parallel Feature Extraction and GPU-Accelerated Learning

TL;DR

The study tackles scalable APT malware classification by leveraging opcode sequences extracted from binaries via Ghidra in headless mode, combined with parallel processing and GPU-accelerated learning. It introduces an automated pipeline that builds 1-gram and 2-gram datasets, applies variance-based feature selection, and tests traditional classifiers (SVM, KNN, Decision Tree) alongside a CNN model inspired by Deep Android Malware Detection. Metadata integration enhances classifier performance, with Decision Trees achieving top results on opcode-plus-metadata configurations and CNNs excelling when operating on opcode sequences alone. The results demonstrate feasible, high-throughput opcode-based malware attribution and highlight the trade-offs between traditional ML and deep learning approaches, particularly in the presence or absence of contextual metadata, with GPU acceleration enabling practical training times for large-scale datasets.

Abstract

This paper presents an underlying framework for both automating and accelerating malware classification, more specifically, mapping malicious executables to known Advanced Persistent Threat (APT) groups. The main feature of this analysis is the assembly-level instructions present in executables which are also known as opcodes. The collection of such opcodes on many malicious samples is a lengthy process; hence, open-source reverse engineering tools are used in tandem with scripts that leverage parallel computing to analyze multiple files at once. Traditional and deep learning models are applied to create models capable of classifying malware samples. One-gram and two-gram datasets are constructed and used to train models such as SVM, KNN, and Decision Tree; however, they struggle to provide adequate results without relying on metadata to support n-gram sequences. The computational limitations of such models are overcome with convolutional neural networks (CNNs) and heavily accelerated using graphical compute unit (GPU) resources.

Paper Structure

This paper contains 74 sections, 54 figures, 14 tables.

Figures (54)

  • Figure 1: One-to-Many Relationship Between Software and APT Groups
  • Figure 2: One-to-One Relationship Between Software and APT Groups
  • Figure 3: Opcode Vocabulary Mapping
  • Figure 4: CNN Sequence
  • Figure 5: Metric Comparison With One-To-Many Dataset
  • ...and 49 more figures