Table of Contents
Fetching ...

Image-Based Malware Classification Using QR and Aztec Codes

Atharva Khadilkar, Mark Stamp

TL;DR

This work investigates image-based malware classification by encoding executable-derived features into QR and Aztec codes and applying CNNs to the resulting images. Using two large datasets, CIC-MalMem-2022 and BODMAS, the authors show that QR/Aztec CNNs dramatically outperform traditional models on the memory-dump–driven CIC-MalMem-2022 dataset, achieving near-perfect multiclass accuracy, while on the static-feature heavy BODMAS dataset classic ML methods remain superior. The study highlights the promise of barcode-based feature representations for challenging, obfuscated malware and underscores the need for further research to understand dataset-dependent effects and methods to mitigate overfitting in image-based approaches. The findings suggest that QR/Aztec code representations can be a valuable tool in malware analysis, particularly for dynamic, obfuscated samples, motivating broader evaluations and improved encoding strategies across more datasets.

Abstract

In recent years, the use of image-based techniques for malware detection has gained prominence, with numerous studies demonstrating the efficacy of deep learning approaches such as Convolutional Neural Networks (CNN) in classifying images derived from executable files. In this paper, we consider an innovative method that relies on an image conversion process that consists of transforming features extracted from executable files into QR and Aztec codes. These codes capture structural patterns in a format that may enhance the learning capabilities of CNNs. We design and implement CNN architectures tailored to the unique properties of these codes and apply them to a comprehensive analysis involving two extensive malware datasets, both of which include a significant corpus of benign samples. Our results yield a split decision, with CNNs trained on QR and Aztec codes outperforming the state of the art on one of the datasets, but underperforming more typical techniques on the other dataset. These results indicate that the use of QR and Aztec codes as a form of feature engineering holds considerable promise in the malware domain, and that additional research is needed to better understand the relative strengths and weaknesses of such an approach.

Image-Based Malware Classification Using QR and Aztec Codes

TL;DR

This work investigates image-based malware classification by encoding executable-derived features into QR and Aztec codes and applying CNNs to the resulting images. Using two large datasets, CIC-MalMem-2022 and BODMAS, the authors show that QR/Aztec CNNs dramatically outperform traditional models on the memory-dump–driven CIC-MalMem-2022 dataset, achieving near-perfect multiclass accuracy, while on the static-feature heavy BODMAS dataset classic ML methods remain superior. The study highlights the promise of barcode-based feature representations for challenging, obfuscated malware and underscores the need for further research to understand dataset-dependent effects and methods to mitigate overfitting in image-based approaches. The findings suggest that QR/Aztec code representations can be a valuable tool in malware analysis, particularly for dynamic, obfuscated samples, motivating broader evaluations and improved encoding strategies across more datasets.

Abstract

In recent years, the use of image-based techniques for malware detection has gained prominence, with numerous studies demonstrating the efficacy of deep learning approaches such as Convolutional Neural Networks (CNN) in classifying images derived from executable files. In this paper, we consider an innovative method that relies on an image conversion process that consists of transforming features extracted from executable files into QR and Aztec codes. These codes capture structural patterns in a format that may enhance the learning capabilities of CNNs. We design and implement CNN architectures tailored to the unique properties of these codes and apply them to a comprehensive analysis involving two extensive malware datasets, both of which include a significant corpus of benign samples. Our results yield a split decision, with CNNs trained on QR and Aztec codes outperforming the state of the art on one of the datasets, but underperforming more typical techniques on the other dataset. These results indicate that the use of QR and Aztec codes as a form of feature engineering holds considerable promise in the malware domain, and that additional research is needed to better understand the relative strengths and weaknesses of such an approach.

Paper Structure

This paper contains 26 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: CIC-MalMem-2022 class distribution
  • Figure 2: Benign and BODMAS class distribution
  • Figure 3: CNN architectures
  • Figure 4: