Sparse vs Contiguous Adversarial Pixel Perturbations in Multimodal Models: An Empirical Analysis
Cristian-Alexandru Botocan, Raphael Meier, Ljiljana Dolamic
TL;DR
This paper investigates the robustness of vision-language architectures against pixel-level adversarial perturbations under a black-box setting using $L_{0}$-norm constraints. It introduces Evolutionary Attacks that employ Differential Evolution to craft Sparse and Contiguous perturbations applied to preprocessed inputs, evaluating on four multimodal models and two unimodal DNNs with ImageNet data. Key findings show that unimodal DNNs are generally more robust than multimodal counterparts, with ViT-based encoders being especially vulnerable to sparse perturbations and CNN-based encoders (e.g., ALIGN) being highly susceptible to patch-based perturbations, achieving up to 99% untargeted SR with minimal area perturbation. The results highlight architecture-specific vulnerabilities and suggest a trade-off between robustness and zero-shot flexibility, informing secure design of multimodal systems and motivating future work on broader model families and attack hyperparameter studies.
Abstract
Assessing the robustness of multimodal models against adversarial examples is an important aspect for the safety of its users. We craft L0-norm perturbation attacks on the preprocessed input images. We launch them in a black-box setup against four multimodal models and two unimodal DNNs, considering both targeted and untargeted misclassification. Our attacks target less than 0.04% of perturbed image area and integrate different spatial positioning of perturbed pixels: sparse positioning and pixels arranged in different contiguous shapes (row, column, diagonal, and patch). To the best of our knowledge, we are the first to assess the robustness of three state-of-the-art multimodal models (ALIGN, AltCLIP, GroupViT) against different sparse and contiguous pixel distribution perturbations. The obtained results indicate that unimodal DNNs are more robust than multimodal models. Furthermore, models using CNN-based Image Encoder are more vulnerable than models with ViT - for untargeted attacks, we obtain a 99% success rate by perturbing less than 0.02% of the image area.
