Use of Multi-CNNs for Section Analysis in Static Malware Detection

Tony Quertier; Grégoire Barrué

Use of Multi-CNNs for Section Analysis in Static Malware Detection

Tony Quertier, Grégoire Barrué

TL;DR

The paper tackles static malware detection with explainability by analyzing Portable Executable files at the section level. It introduces a distributed framework that converts each PE section into a $64×64$ grayscale image, trains a CNN per section, and learns a final classification score from the per-section outputs using XGBoost or Random Forest. The approach yields a modest accuracy gain of about $1.5\%$ over a single full-file CNN and provides interpretable insights into which sections drive decisions via permutation importance and Mean Decrease in Impurity analyses, notably highlighting the .idata and .rsrc sections. This method offers actionable, scalable guidance for analysts and can be extended by adding more PE sections to further improve performance and explainability.

Abstract

Existing research on malware detection focuses almost exclusively on the detection rate. However, in some cases, it is also important to understand the results of our algorithm, or to obtain more information, such as where to investigate in the file for an analyst. In this aim, we propose a new model to analyze Portable Executable files. Our method consists in splitting the files in different sections, then transform each section into an image, in order to train convolutional neural networks to treat specifically each identified section. Then we use all these scores returned by CNNs to compute a final detection score, using models that enable us to improve our analysis of the importance of each section in the final score.

Use of Multi-CNNs for Section Analysis in Static Malware Detection

TL;DR

The paper tackles static malware detection with explainability by analyzing Portable Executable files at the section level. It introduces a distributed framework that converts each PE section into a

grayscale image, trains a CNN per section, and learns a final classification score from the per-section outputs using XGBoost or Random Forest. The approach yields a modest accuracy gain of about

over a single full-file CNN and provides interpretable insights into which sections drive decisions via permutation importance and Mean Decrease in Impurity analyses, notably highlighting the .idata and .rsrc sections. This method offers actionable, scalable guidance for analysts and can be extended by adding more PE sections to further improve performance and explainability.

Abstract

Paper Structure (7 sections, 5 figures, 3 tables)

This paper contains 7 sections, 5 figures, 3 tables.

Dataset and preprocessing
Dataset
PE file format
Splitting the image into sub-images
Framework
Experiments and results
Conclusion and future work

Figures (5)

Figure 1: Sections in a malware
Figure 2: Images of different binary's sections
Figure 3: Architecture of our algorithm. We start from a binary file, we decompose it into images of its sections. Each section is used to train a specific CNN, then we gather the scores of the CNN in a vector used as input to train a scoring function. Once everything is trained, we get a model which take the binary as input and give its predicted label as output.
Figure 4: Mean Decrease Impurity for RF and XGBoost models
Figure 5: Permutation feature importance for XGBoost and RF models

Use of Multi-CNNs for Section Analysis in Static Malware Detection

TL;DR

Abstract

Use of Multi-CNNs for Section Analysis in Static Malware Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (5)