Table of Contents
Fetching ...

OpCode-Based Malware Classification Using Machine Learning and Deep Learning Techniques

Varij Saini, Rudraksh Gupta, Neel Soni

TL;DR

The paper compares OpCode sequence based malware classification using traditional ML with 1-gram/2-gram features and a CNN that learns from raw OpCodes on a labeled dataset of APT families. It finds that a linear SVM achieves the best accuracy among traditional models (66.37% with F1 64.04%), while a CNN attains 62.14% accuracy, demonstrating the potential for automated feature learning but not surpassing the SVM on this dataset. The study discusses practical implications for threat intelligence and deployment, highlighting the trade-offs between handcrafted features and deep learning. It also outlines a roadmap for future work, including dynamic analysis, larger datasets, advanced architectures like Transformer models, transfer learning, and explainability to improve robustness and trust.

Abstract

This technical report presents a comprehensive analysis of malware classification using OpCode sequences. Two distinct approaches are evaluated: traditional machine learning using n-gram analysis with Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and Decision Tree classifiers; and a deep learning approach employing a Convolutional Neural Network (CNN). The traditional machine learning approach establishes a baseline using handcrafted 1-gram and 2-gram features from disassembled malware samples. The deep learning methodology builds upon the work proposed in "Deep Android Malware Detection" by McLaughlin et al. and evaluates the performance of a CNN model trained to automatically extract features from raw OpCode data. Empirical results are compared using standard performance metrics (accuracy, precision, recall, and F1-score). While the SVM classifier outperforms other traditional techniques, the CNN model demonstrates competitive performance with the added benefit of automated feature extraction.

OpCode-Based Malware Classification Using Machine Learning and Deep Learning Techniques

TL;DR

The paper compares OpCode sequence based malware classification using traditional ML with 1-gram/2-gram features and a CNN that learns from raw OpCodes on a labeled dataset of APT families. It finds that a linear SVM achieves the best accuracy among traditional models (66.37% with F1 64.04%), while a CNN attains 62.14% accuracy, demonstrating the potential for automated feature learning but not surpassing the SVM on this dataset. The study discusses practical implications for threat intelligence and deployment, highlighting the trade-offs between handcrafted features and deep learning. It also outlines a roadmap for future work, including dynamic analysis, larger datasets, advanced architectures like Transformer models, transfer learning, and explainability to improve robustness and trust.

Abstract

This technical report presents a comprehensive analysis of malware classification using OpCode sequences. Two distinct approaches are evaluated: traditional machine learning using n-gram analysis with Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and Decision Tree classifiers; and a deep learning approach employing a Convolutional Neural Network (CNN). The traditional machine learning approach establishes a baseline using handcrafted 1-gram and 2-gram features from disassembled malware samples. The deep learning methodology builds upon the work proposed in "Deep Android Malware Detection" by McLaughlin et al. and evaluates the performance of a CNN model trained to automatically extract features from raw OpCode data. Empirical results are compared using standard performance metrics (accuracy, precision, recall, and F1-score). While the SVM classifier outperforms other traditional techniques, the CNN model demonstrates competitive performance with the added benefit of automated feature extraction.

Paper Structure

This paper contains 17 sections.