Table of Contents
Fetching ...

Model X-Ray: Detection of Hidden Malware in AI Model Weights using Few Shot Learning

Daniel Gilkarov, Ran Dubin

TL;DR

This work tackles the threat of malware hidden in AI model weights through steganography by reframing AI models as images using the Grayscale-Fourpart representation and applying few-shot learning with CNN embeddings. A triplet-based convolutional embedding network, evaluated with Centroid and KNN classifiers, achieves robust detection with only six training models and maintains performance across unseen architectures and payloads, including MaleficNet OOD attacks. The approach detects attacks with embedding rates as low as $ER \le 25\%$ and even $6\%$ in some cases, outperforming prior baselines that required orders of magnitude more data. The method is complemented by open-source tooling and validates practical applicability across a spectrum of model sizes and tasks, offering a scalable security layer for model repositories and end users.

Abstract

The potential for exploitation of AI models has increased due to the rapid advancement of Artificial Intelligence (AI) and the widespread use of platforms like Model Zoo for sharing AI models. Attackers can embed malware within AI models through steganographic techniques, taking advantage of the substantial size of these models to conceal malicious data and use it for nefarious purposes, e.g. Remote Code Execution. Ensuring the security of AI models is a burgeoning area of research essential for safeguarding the multitude of organizations and users relying on AI technologies. This study leverages well-studied image few-shot learning techniques by transferring the AI models to the image field using a novel image representation. Applying few-shot learning in this field enables us to create practical models, a feat that previous works lack. Our method addresses critical limitations in state-of-the-art detection techniques that hinder their practicality. This approach reduces the required training dataset size from 40000 models to just 6. Furthermore, our methods consistently detect delicate attacks of up to 25% embedding rate and even up to 6% in some cases, while previous works were only shown to be effective for a 100%-50% embedding rate. We employ a strict evaluation strategy to ensure the trained models are generic concerning various factors. In addition, we show that our trained models successfully detect novel spread-spectrum steganography attacks, demonstrating the models' impressive robustness just by learning one type of attack. We open-source our code to support reproducibility and enhance the research in this new field.

Model X-Ray: Detection of Hidden Malware in AI Model Weights using Few Shot Learning

TL;DR

This work tackles the threat of malware hidden in AI model weights through steganography by reframing AI models as images using the Grayscale-Fourpart representation and applying few-shot learning with CNN embeddings. A triplet-based convolutional embedding network, evaluated with Centroid and KNN classifiers, achieves robust detection with only six training models and maintains performance across unseen architectures and payloads, including MaleficNet OOD attacks. The approach detects attacks with embedding rates as low as and even in some cases, outperforming prior baselines that required orders of magnitude more data. The method is complemented by open-source tooling and validates practical applicability across a spectrum of model sizes and tasks, offering a scalable security layer for model repositories and end users.

Abstract

The potential for exploitation of AI models has increased due to the rapid advancement of Artificial Intelligence (AI) and the widespread use of platforms like Model Zoo for sharing AI models. Attackers can embed malware within AI models through steganographic techniques, taking advantage of the substantial size of these models to conceal malicious data and use it for nefarious purposes, e.g. Remote Code Execution. Ensuring the security of AI models is a burgeoning area of research essential for safeguarding the multitude of organizations and users relying on AI technologies. This study leverages well-studied image few-shot learning techniques by transferring the AI models to the image field using a novel image representation. Applying few-shot learning in this field enables us to create practical models, a feat that previous works lack. Our method addresses critical limitations in state-of-the-art detection techniques that hinder their practicality. This approach reduces the required training dataset size from 40000 models to just 6. Furthermore, our methods consistently detect delicate attacks of up to 25% embedding rate and even up to 6% in some cases, while previous works were only shown to be effective for a 100%-50% embedding rate. We employ a strict evaluation strategy to ensure the trained models are generic concerning various factors. In addition, we show that our trained models successfully detect novel spread-spectrum steganography attacks, demonstrating the models' impressive robustness just by learning one type of attack. We open-source our code to support reproducibility and enhance the research in this new field.
Paper Structure (40 sections, 7 equations, 8 figures, 4 tables, 3 algorithms)

This paper contains 40 sections, 7 equations, 8 figures, 4 tables, 3 algorithms.

Figures (8)

  • Figure 1: Overall dataset creation process (Section \ref{['sec:dataset_creation']}). First, we get pre-trained models from model repositories and create groups of model zoos based on model size. We call them Model Collections. In this example, we have $MC_1$ with 2 CNNs and $MC_2$ with 3 different DNNs. Second, we use each model collection $MC_i$ to create an attacked version of it by using X-LSB-Attack-Fill (Section \ref{['sec:x_lsb_attack']}) with some value $X_i$ and malware payload $m_i$. Finally, the original and attacked MCs (colored blue and red respectively) go through a pre-processing phase. In this example, we calculate some model image representation $I_i$ (Section \ref{['sec:image_creation']}) and reshape all resulting images to shape $sh_i$. The process results in 2 datasets of benign and malicious model image representations created from $MC_1$ and $MC_2$.
  • Figure 2: Illustration of classifying a new sample (colored yellow, see (2)) using the trained CNN (3a) with the centroid method. We compute the embeddings of our benign and malicious training samples (colored blue and red, see (1a) and (1b)) and average them to get the centroid embeddings (4). Then, we classify the new sample by measuring the $l_2$ distance of its embedding to both centroid embeddings and giving it the label for the closer of the two (5). In this example, the new sample is labeled benign because it got 0.27 distance to the benign centroid as opposed to 0.56. This procedure is essentially the same as applying 1-Nearest-Neighbor w/$l_2$ distance using the centroids' embeddings.
  • Figure 3: Experiment 1 OML results (see Section \ref{['sec:plot_types']}). We train OSL CNN models on 6 samples from the SCZ dataset. The baseline result from prior work gilkarov2023steganalysis is plotted as a dotted green line.
  • Figure 4: Experiment 2 OML ID results. FSL models trained on the famous small CNNs train set and tested on the small CNNs test set.
  • Figure 5: Experiment 2 AL ID results. FSL models trained on the famous small CNNs train set and tested on the small CNNs test set.
  • ...and 3 more figures