Table of Contents
Fetching ...

When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models

Chengyin Hu, Xuemeng Sun, Jiajun Han, Qike Zhang, Xiang Chen, Xin Wang, Yiwei Wei, Jiahua Long

Abstract

Visual-Language Models (VLMs) have demonstrated exceptional cross-modal understanding across various tasks, including zero-shot classification, image captioning, and visual question answering. However, their robustness to physically plausible non-rigid deformations-such as wrinkles on flexible surfaces-remains poorly understood. In this work, we propose a parametric structural perturbation method inspired by the mechanics of three-dimensional fabric wrinkles. Specifically, our method generates photorealistic non-rigid perturbations by constructing multi-scale wrinkle fields and integrating displacement field distortion with surface-consistent appearance variations. To achieve an optimal balance between visual naturalness and adversarial effectiveness, we design a hierarchical fitness function in a low-dimensional parameter space and employ an optimization-based search strategy. We evaluate our approach using a two-stage framework: perturbations are first optimized on a zero-shot classification proxy task and subsequently assessed for transferability on generative tasks. Experimental results demonstrate that our method significantly degrades the performance of various state-of-the-art VLMs, consistently outperforming baselines in both image captioning and visual question-answering tasks.

When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models

Abstract

Visual-Language Models (VLMs) have demonstrated exceptional cross-modal understanding across various tasks, including zero-shot classification, image captioning, and visual question answering. However, their robustness to physically plausible non-rigid deformations-such as wrinkles on flexible surfaces-remains poorly understood. In this work, we propose a parametric structural perturbation method inspired by the mechanics of three-dimensional fabric wrinkles. Specifically, our method generates photorealistic non-rigid perturbations by constructing multi-scale wrinkle fields and integrating displacement field distortion with surface-consistent appearance variations. To achieve an optimal balance between visual naturalness and adversarial effectiveness, we design a hierarchical fitness function in a low-dimensional parameter space and employ an optimization-based search strategy. We evaluate our approach using a two-stage framework: perturbations are first optimized on a zero-shot classification proxy task and subsequently assessed for transferability on generative tasks. Experimental results demonstrate that our method significantly degrades the performance of various state-of-the-art VLMs, consistently outperforming baselines in both image captioning and visual question-answering tasks.

Paper Structure

This paper contains 13 sections, 19 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 2: Qualitative Results on Image Captioning and Visual Question-Answering. The figure shows representative outputs for clean and perturbed images, illustrating caption deviations and incorrect answers caused by the proposed perturbation.
  • Figure 3: Ablation on Genetic Search Hyperparameters. Attack performance of the proposed wrinkle-like structural perturbation under (a) different numbers of fitness evaluations and (b) different population sizes, showing the trade-off between attack effectiveness and search efficiency across different CLIP models.
  • Figure 4: Ablation on multi-scale wrinkle components. ASR (%) under different component combinations is reported, including large only, medium only, small only, large+medium, and full.
  • Figure 5: Ablation study of the weighting coefficients $\alpha_1$ and $\alpha_2$ on EVA-CLIP ViT-G/14.
  • Figure : (a)
  • ...and 2 more figures