Table of Contents
Fetching ...

Semantic Preprocessing for LLM-based Malware Analysis

Benjamin Marais, Tony Quertier, Grégoire Barrue

TL;DR

This work tackles the challenge of data representation in AI-based malware analysis by introducing a semantic preprocessing pipeline that builds detailed JSON reports for PE files, combining static features with packer signals and knowledge mappings from MITRE ATT&CK and the Malware Behavior Catalog (MBC) to improve interpretability. The reports, incorporating IAT and section-level and packing information, enable transformer-based classifiers (BERT and ModernBERT) to perform eight-way malware category classification on a realistic, imbalanced dataset with a weighted F1 of about $0.94$. The approach demonstrates strong performance and explainability potential, while also highlighting limitations (e.g., droppers and underrepresented classes) and avenues for enhancement such as dynamic analysis integration and explainability via attention-based summaries. Overall, the semantic, expert-informed preprocessing offers a practical, modular foundation for AI-assisted malware triage and classification, with open opportunities for further improvement and broader adoption.

Abstract

In a context of malware analysis, numerous approaches rely on Artificial Intelligence to handle a large volume of data. However, these techniques focus on data view (images, sequences) and not on an expert's view. Noticing this issue, we propose a preprocessing that focuses on expert knowledge to improve malware semantic analysis and result interpretability. We propose a new preprocessing method which creates JSON reports for Portable Executable files. These reports gather features from both static and behavioral analysis, and incorporate packer signature detection, MITRE ATT\&CK and Malware Behavior Catalog (MBC) knowledge. The purpose of this preprocessing is to gather a semantic representation of binary files, understandable by malware analysts, and that can enhance AI models' explainability for malicious files analysis. Using this preprocessing to train a Large Language Model for Malware classification, we achieve a weighted-average F1-score of 0.94 on a complex dataset, representative of market reality.

Semantic Preprocessing for LLM-based Malware Analysis

TL;DR

This work tackles the challenge of data representation in AI-based malware analysis by introducing a semantic preprocessing pipeline that builds detailed JSON reports for PE files, combining static features with packer signals and knowledge mappings from MITRE ATT&CK and the Malware Behavior Catalog (MBC) to improve interpretability. The reports, incorporating IAT and section-level and packing information, enable transformer-based classifiers (BERT and ModernBERT) to perform eight-way malware category classification on a realistic, imbalanced dataset with a weighted F1 of about . The approach demonstrates strong performance and explainability potential, while also highlighting limitations (e.g., droppers and underrepresented classes) and avenues for enhancement such as dynamic analysis integration and explainability via attention-based summaries. Overall, the semantic, expert-informed preprocessing offers a practical, modular foundation for AI-assisted malware triage and classification, with open opportunities for further improvement and broader adoption.

Abstract

In a context of malware analysis, numerous approaches rely on Artificial Intelligence to handle a large volume of data. However, these techniques focus on data view (images, sequences) and not on an expert's view. Noticing this issue, we propose a preprocessing that focuses on expert knowledge to improve malware semantic analysis and result interpretability. We propose a new preprocessing method which creates JSON reports for Portable Executable files. These reports gather features from both static and behavioral analysis, and incorporate packer signature detection, MITRE ATT\&CK and Malware Behavior Catalog (MBC) knowledge. The purpose of this preprocessing is to gather a semantic representation of binary files, understandable by malware analysts, and that can enhance AI models' explainability for malicious files analysis. Using this preprocessing to train a Large Language Model for Malware classification, we achieve a weighted-average F1-score of 0.94 on a complex dataset, representative of market reality.

Paper Structure

This paper contains 11 sections, 1 equation, 13 figures, 10 tables.

Figures (13)

  • Figure 1: Simplified architecture of a PE file
  • Figure 2: Global file information
  • Figure 3: File section information
  • Figure 4: IAT information
  • Figure 5: Packing signatures
  • ...and 8 more figures