Table of Contents
Fetching ...

Sparse vs Contiguous Adversarial Pixel Perturbations in Multimodal Models: An Empirical Analysis

Cristian-Alexandru Botocan, Raphael Meier, Ljiljana Dolamic

TL;DR

This paper investigates the robustness of vision-language architectures against pixel-level adversarial perturbations under a black-box setting using $L_{0}$-norm constraints. It introduces Evolutionary Attacks that employ Differential Evolution to craft Sparse and Contiguous perturbations applied to preprocessed inputs, evaluating on four multimodal models and two unimodal DNNs with ImageNet data. Key findings show that unimodal DNNs are generally more robust than multimodal counterparts, with ViT-based encoders being especially vulnerable to sparse perturbations and CNN-based encoders (e.g., ALIGN) being highly susceptible to patch-based perturbations, achieving up to 99% untargeted SR with minimal area perturbation. The results highlight architecture-specific vulnerabilities and suggest a trade-off between robustness and zero-shot flexibility, informing secure design of multimodal systems and motivating future work on broader model families and attack hyperparameter studies.

Abstract

Assessing the robustness of multimodal models against adversarial examples is an important aspect for the safety of its users. We craft L0-norm perturbation attacks on the preprocessed input images. We launch them in a black-box setup against four multimodal models and two unimodal DNNs, considering both targeted and untargeted misclassification. Our attacks target less than 0.04% of perturbed image area and integrate different spatial positioning of perturbed pixels: sparse positioning and pixels arranged in different contiguous shapes (row, column, diagonal, and patch). To the best of our knowledge, we are the first to assess the robustness of three state-of-the-art multimodal models (ALIGN, AltCLIP, GroupViT) against different sparse and contiguous pixel distribution perturbations. The obtained results indicate that unimodal DNNs are more robust than multimodal models. Furthermore, models using CNN-based Image Encoder are more vulnerable than models with ViT - for untargeted attacks, we obtain a 99% success rate by perturbing less than 0.02% of the image area.

Sparse vs Contiguous Adversarial Pixel Perturbations in Multimodal Models: An Empirical Analysis

TL;DR

This paper investigates the robustness of vision-language architectures against pixel-level adversarial perturbations under a black-box setting using -norm constraints. It introduces Evolutionary Attacks that employ Differential Evolution to craft Sparse and Contiguous perturbations applied to preprocessed inputs, evaluating on four multimodal models and two unimodal DNNs with ImageNet data. Key findings show that unimodal DNNs are generally more robust than multimodal counterparts, with ViT-based encoders being especially vulnerable to sparse perturbations and CNN-based encoders (e.g., ALIGN) being highly susceptible to patch-based perturbations, achieving up to 99% untargeted SR with minimal area perturbation. The results highlight architecture-specific vulnerabilities and suggest a trade-off between robustness and zero-shot flexibility, informing secure design of multimodal systems and motivating future work on broader model families and attack hyperparameter studies.

Abstract

Assessing the robustness of multimodal models against adversarial examples is an important aspect for the safety of its users. We craft L0-norm perturbation attacks on the preprocessed input images. We launch them in a black-box setup against four multimodal models and two unimodal DNNs, considering both targeted and untargeted misclassification. Our attacks target less than 0.04% of perturbed image area and integrate different spatial positioning of perturbed pixels: sparse positioning and pixels arranged in different contiguous shapes (row, column, diagonal, and patch). To the best of our knowledge, we are the first to assess the robustness of three state-of-the-art multimodal models (ALIGN, AltCLIP, GroupViT) against different sparse and contiguous pixel distribution perturbations. The obtained results indicate that unimodal DNNs are more robust than multimodal models. Furthermore, models using CNN-based Image Encoder are more vulnerable than models with ViT - for untargeted attacks, we obtain a 99% success rate by perturbing less than 0.02% of the image area.
Paper Structure (18 sections, 4 equations, 6 figures, 2 tables, 1 algorithm)

This paper contains 18 sections, 4 equations, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: Threat Model Visualization
  • Figure 2: Agent encoding for the Sparse Attack
  • Figure 3: Spatial arrangement of pixels in Contiguous Attacks for exemplary case of four perturbed pixels
  • Figure 4: Pixel perturbations for different attacks on patch-based models (e.g. Vision Transformers, ViTs)
  • Figure 5: Targeted Attacks
  • ...and 1 more figures