Malware Classification Leveraging NLP & Machine Learning for Enhanced Accuracy

Bishwajit Prasad Gond; Rajneekant; Pushkar Kishore; Durga Prasad Mohapatra

Malware Classification Leveraging NLP & Machine Learning for Enhanced Accuracy

Bishwajit Prasad Gond, Rajneekant, Pushkar Kishore, Durga Prasad Mohapatra

TL;DR

This paper explores how NLP can be used to extract and analyze textual features from malware samples through n-grams, contiguous string or API call sequences, and delve into n-gram size selection, feature representation, and classification algorithms.

Abstract

This paper investigates the application of natural language processing (NLP)-based n-gram analysis and machine learning techniques to enhance malware classification. We explore how NLP can be used to extract and analyze textual features from malware samples through n-grams, contiguous string or API call sequences. This approach effectively captures distinctive linguistic patterns among malware and benign families, enabling finer-grained classification. We delve into n-gram size selection, feature representation, and classification algorithms. While evaluating our proposed method on real-world malware samples, we observe significantly improved accuracy compared to the traditional methods. By implementing our n-gram approach, we achieved an accuracy of 99.02% across various machine learning algorithms by using hybrid feature selection technique to address high dimensionality. Hybrid feature selection technique reduces the feature set to only 1.6% of the original features.

Malware Classification Leveraging NLP & Machine Learning for Enhanced Accuracy

TL;DR

Abstract

Malware Classification Leveraging NLP & Machine Learning for Enhanced Accuracy

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)