Table of Contents
Fetching ...

Vera Verto: Multimodal Hijacking Attack

Minxing Zhang, Ahmed Salem, Michael Backes, Yang Zhang

TL;DR

This work tackles the threat of hijacking machine learning models by extending attacks from single-modality (vision) to multimodal settings, enabling NLP hijacking tasks to be executed within CV classifiers. It introduces Blender, an encoder-decoder framework that fuses hijacking text with container images to preserve the original CV task while enabling the hijacking objective. Through extensive experiments on MNIST, CIFAR-10, STL-10 and NLP datasets (Yelp, Sogou), the method achieves high attack success rates (up to around 94–95%) with minimal loss in the victim model’s utility, and demonstrates robustness across architectures and datasets. The study also analyzes defense mechanisms and discusses practical considerations such as stealthiness, compute costs, and the reusability of the Blender for multiple hijacking settings, underscoring the need for mitigations in real-world training pipelines.

Abstract

The increasing cost of training machine learning (ML) models has led to the inclusion of new parties to the training pipeline, such as users who contribute training data and companies that provide computing resources. This involvement of such new parties in the ML training process has introduced new attack surfaces for an adversary to exploit. A recent attack in this domain is the model hijacking attack, whereby an adversary hijacks a victim model to implement their own -- possibly malicious -- hijacking tasks. However, the scope of the model hijacking attack is so far limited to the homogeneous-modality tasks. In this paper, we transform the model hijacking attack into a more general multimodal setting, where the hijacking and original tasks are performed on data of different modalities. Specifically, we focus on the setting where an adversary implements a natural language processing (NLP) hijacking task into an image classification model. To mount the attack, we propose a novel encoder-decoder based framework, namely the Blender, which relies on advanced image and language models. Experimental results show that our modal hijacking attack achieves strong performances in different settings. For instance, our attack achieves 94%, 94%, and 95% attack success rate when using the Sogou news dataset to hijack STL10, CIFAR-10, and MNIST classifiers.

Vera Verto: Multimodal Hijacking Attack

TL;DR

This work tackles the threat of hijacking machine learning models by extending attacks from single-modality (vision) to multimodal settings, enabling NLP hijacking tasks to be executed within CV classifiers. It introduces Blender, an encoder-decoder framework that fuses hijacking text with container images to preserve the original CV task while enabling the hijacking objective. Through extensive experiments on MNIST, CIFAR-10, STL-10 and NLP datasets (Yelp, Sogou), the method achieves high attack success rates (up to around 94–95%) with minimal loss in the victim model’s utility, and demonstrates robustness across architectures and datasets. The study also analyzes defense mechanisms and discusses practical considerations such as stealthiness, compute costs, and the reusability of the Blender for multiple hijacking settings, underscoring the need for mitigations in real-world training pipelines.

Abstract

The increasing cost of training machine learning (ML) models has led to the inclusion of new parties to the training pipeline, such as users who contribute training data and companies that provide computing resources. This involvement of such new parties in the ML training process has introduced new attack surfaces for an adversary to exploit. A recent attack in this domain is the model hijacking attack, whereby an adversary hijacks a victim model to implement their own -- possibly malicious -- hijacking tasks. However, the scope of the model hijacking attack is so far limited to the homogeneous-modality tasks. In this paper, we transform the model hijacking attack into a more general multimodal setting, where the hijacking and original tasks are performed on data of different modalities. Specifically, we focus on the setting where an adversary implements a natural language processing (NLP) hijacking task into an image classification model. To mount the attack, we propose a novel encoder-decoder based framework, namely the Blender, which relies on advanced image and language models. Experimental results show that our modal hijacking attack achieves strong performances in different settings. For instance, our attack achieves 94%, 94%, and 95% attack success rate when using the Sogou news dataset to hijack STL10, CIFAR-10, and MNIST classifiers.
Paper Structure (22 sections, 3 equations, 23 figures, 2 tables)

This paper contains 22 sections, 3 equations, 23 figures, 2 tables.

Figures (23)

  • Figure 1: An overview of the multimodal hijacking attack. First, the Blender takes a sample from both the hijacking and container datasets. It then mixes both of these inputs to have a fused image with the looks of the container one but with the features of the hijacking text input. The model is able to perform the original classification task (classifying the image as a horse) and the hijacking one, i.e., classifying the fused image as 4-star (the label of the hijacking input).
  • Figure 2: Our multimodal hijacking attack performance, where x_y notation is the hijacking_original dataset pair.
  • Figure 3: Our multimodal hijacking attack performance on different hijacking datasets, where Tiny ImageNet is our original dataset and mobilenetv2 is the target model. The low utility of the clean models (27%) is due to a large number of labels (1,000), and the limited number of samples in the dataset.
  • Figure 4: The comparison of our multimodal hijacking attack performances between using complete embeddings of hijacking sentence and the last -- "[cls]" -- token. The hijacking dataset is Yelp and the original dataset is CIFAR-10.
  • Figure 5: The comparison of our multimodal hijacking attack performances between using the adapter and our Blender, where x_y notation is the hijacking_original dataset pair.
  • ...and 18 more figures