Table of Contents
Fetching ...

Malware Classification Leveraging NLP & Machine Learning for Enhanced Accuracy

Bishwajit Prasad Gond, Rajneekant, Pushkar Kishore, Durga Prasad Mohapatra

TL;DR

This paper explores how NLP can be used to extract and analyze textual features from malware samples through n-grams, contiguous string or API call sequences, and delve into n-gram size selection, feature representation, and classification algorithms.

Abstract

This paper investigates the application of natural language processing (NLP)-based n-gram analysis and machine learning techniques to enhance malware classification. We explore how NLP can be used to extract and analyze textual features from malware samples through n-grams, contiguous string or API call sequences. This approach effectively captures distinctive linguistic patterns among malware and benign families, enabling finer-grained classification. We delve into n-gram size selection, feature representation, and classification algorithms. While evaluating our proposed method on real-world malware samples, we observe significantly improved accuracy compared to the traditional methods. By implementing our n-gram approach, we achieved an accuracy of 99.02% across various machine learning algorithms by using hybrid feature selection technique to address high dimensionality. Hybrid feature selection technique reduces the feature set to only 1.6% of the original features.

Malware Classification Leveraging NLP & Machine Learning for Enhanced Accuracy

TL;DR

This paper explores how NLP can be used to extract and analyze textual features from malware samples through n-grams, contiguous string or API call sequences, and delve into n-gram size selection, feature representation, and classification algorithms.

Abstract

This paper investigates the application of natural language processing (NLP)-based n-gram analysis and machine learning techniques to enhance malware classification. We explore how NLP can be used to extract and analyze textual features from malware samples through n-grams, contiguous string or API call sequences. This approach effectively captures distinctive linguistic patterns among malware and benign families, enabling finer-grained classification. We delve into n-gram size selection, feature representation, and classification algorithms. While evaluating our proposed method on real-world malware samples, we observe significantly improved accuracy compared to the traditional methods. By implementing our n-gram approach, we achieved an accuracy of 99.02% across various machine learning algorithms by using hybrid feature selection technique to address high dimensionality. Hybrid feature selection technique reduces the feature set to only 1.6% of the original features.

Paper Structure

This paper contains 11 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Proposed Architecture for Malware Classification
  • Figure 2: Data Preprocessing and Feature Engineering
  • Figure 3: Confusion Matrices for Malware Classification Using ML Techniques
  • Figure :
  • Figure :