Vera Verto: Multimodal Hijacking Attack
Minxing Zhang, Ahmed Salem, Michael Backes, Yang Zhang
TL;DR
This work tackles the threat of hijacking machine learning models by extending attacks from single-modality (vision) to multimodal settings, enabling NLP hijacking tasks to be executed within CV classifiers. It introduces Blender, an encoder-decoder framework that fuses hijacking text with container images to preserve the original CV task while enabling the hijacking objective. Through extensive experiments on MNIST, CIFAR-10, STL-10 and NLP datasets (Yelp, Sogou), the method achieves high attack success rates (up to around 94–95%) with minimal loss in the victim model’s utility, and demonstrates robustness across architectures and datasets. The study also analyzes defense mechanisms and discusses practical considerations such as stealthiness, compute costs, and the reusability of the Blender for multiple hijacking settings, underscoring the need for mitigations in real-world training pipelines.
Abstract
The increasing cost of training machine learning (ML) models has led to the inclusion of new parties to the training pipeline, such as users who contribute training data and companies that provide computing resources. This involvement of such new parties in the ML training process has introduced new attack surfaces for an adversary to exploit. A recent attack in this domain is the model hijacking attack, whereby an adversary hijacks a victim model to implement their own -- possibly malicious -- hijacking tasks. However, the scope of the model hijacking attack is so far limited to the homogeneous-modality tasks. In this paper, we transform the model hijacking attack into a more general multimodal setting, where the hijacking and original tasks are performed on data of different modalities. Specifically, we focus on the setting where an adversary implements a natural language processing (NLP) hijacking task into an image classification model. To mount the attack, we propose a novel encoder-decoder based framework, namely the Blender, which relies on advanced image and language models. Experimental results show that our modal hijacking attack achieves strong performances in different settings. For instance, our attack achieves 94%, 94%, and 95% attack success rate when using the Sogou news dataset to hijack STL10, CIFAR-10, and MNIST classifiers.
