Multimodal Techniques for Malware Classification
Jonathan Jiang, Mark Stamp
TL;DR
This work tackles malware classification of Windows PE files by exploiting their multimodal structure: headers, sections, and the full file. It trains SVM, LSTM, and CNN models on each modality and then composes multimodal classifiers by feeding the component outputs as features into an SVM, achieving higher accuracy than any single-modality model. Key findings show that multimodal fusion, particularly combinations like (LSTM with CNN) or (CNN with CNN) feeding an SVM, yields about 0.993 accuracy on a 2114-sample, five-class dataset, outperforming baselines by roughly 1 percentage point. The results demonstrate the value of treating PE file structure as diverse data sources and using probability-vector fusion to leverage complementary strengths of different learning paradigms.
Abstract
The threat of malware is a serious concern for computer networks and systems, highlighting the need for accurate classification techniques. In this research, we experiment with multimodal machine learning approaches for malware classification, based on the structured nature of the Windows Portable Executable (PE) file format. Specifically, we train Support Vector Machine (SVM), Long Short-Term Memory (LSTM), and Convolutional Neural Network (CNN) models on features extracted from PE headers, we train these same models on features extracted from the other sections of PE files, and train each model on features extracted from the entire PE file. We then train SVM models on each of the nine header-sections combinations of these baseline models, using the output layer probabilities of the component models as feature vectors. We compare the baseline cases to these multimodal combinations. In our experiments, we find that the best of the multimodal models outperforms the best of the baseline cases, indicating that it can be advantageous to train separate models on distinct parts of Windows PE files.
