Table of Contents
Fetching ...

Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting

Xiaohan Zhao, Zhaoyi Li, Yaxin Luo, Jiacheng Cui, Zhiqiang Shen

TL;DR

M-Attack-V2 is a simple, modular enhancement over M-Attack that substantially improves transfer-based black-box attacks on frontier LVLMs: boosting success rates on Claude-4.0 from 8% to 30%, Gemini-2.5-Pro from 83% to 97%, and GPT-5 from 98% to 100%, outperforming prior black-box LVLM attacks.

Abstract

Black-box adversarial attacks on Large Vision-Language Models (LVLMs) are challenging due to missing gradients and complex multimodal boundaries. While prior state-of-the-art transfer-based approaches like M-Attack perform well using local crop-level matching between source and target images, we find this induces high-variance, nearly orthogonal gradients across iterations, violating coherent local alignment and destabilizing optimization. We attribute this to (i) ViT translation sensitivity that yields spike-like gradients and (ii) structural asymmetry between source and target crops. We reformulate local matching as an asymmetric expectation over source transformations and target semantics, and build a gradient-denoising upgrade to M-Attack. On the source side, Multi-Crop Alignment (MCA) averages gradients from multiple independently sampled local views per iteration to reduce variance. On the target side, Auxiliary Target Alignment (ATA) replaces aggressive target augmentation with a small auxiliary set from a semantically correlated distribution, producing a smoother, lower-variance target manifold. We further reinterpret momentum as Patch Momentum, replaying historical crop gradients; combined with a refined patch-size ensemble (PE+), this strengthens transferable directions. Together these modules form M-Attack-V2, a simple, modular enhancement over M-Attack that substantially improves transfer-based black-box attacks on frontier LVLMs: boosting success rates on Claude-4.0 from 8% to 30%, Gemini-2.5-Pro from 83% to 97%, and GPT-5 from 98% to 100%, outperforming prior black-box LVLM attacks. Code and data are publicly available at: https://github.com/vila-lab/M-Attack-V2.

Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting

TL;DR

M-Attack-V2 is a simple, modular enhancement over M-Attack that substantially improves transfer-based black-box attacks on frontier LVLMs: boosting success rates on Claude-4.0 from 8% to 30%, Gemini-2.5-Pro from 83% to 97%, and GPT-5 from 98% to 100%, outperforming prior black-box LVLM attacks.

Abstract

Black-box adversarial attacks on Large Vision-Language Models (LVLMs) are challenging due to missing gradients and complex multimodal boundaries. While prior state-of-the-art transfer-based approaches like M-Attack perform well using local crop-level matching between source and target images, we find this induces high-variance, nearly orthogonal gradients across iterations, violating coherent local alignment and destabilizing optimization. We attribute this to (i) ViT translation sensitivity that yields spike-like gradients and (ii) structural asymmetry between source and target crops. We reformulate local matching as an asymmetric expectation over source transformations and target semantics, and build a gradient-denoising upgrade to M-Attack. On the source side, Multi-Crop Alignment (MCA) averages gradients from multiple independently sampled local views per iteration to reduce variance. On the target side, Auxiliary Target Alignment (ATA) replaces aggressive target augmentation with a small auxiliary set from a semantically correlated distribution, producing a smoother, lower-variance target manifold. We further reinterpret momentum as Patch Momentum, replaying historical crop gradients; combined with a refined patch-size ensemble (PE+), this strengthens transferable directions. Together these modules form M-Attack-V2, a simple, modular enhancement over M-Attack that substantially improves transfer-based black-box attacks on frontier LVLMs: boosting success rates on Claude-4.0 from 8% to 30%, Gemini-2.5-Pro from 83% to 97%, and GPT-5 from 98% to 100%, outperforming prior black-box LVLM attacks. Code and data are publicly available at: https://github.com/vila-lab/M-Attack-V2.
Paper Structure (38 sections, 2 theorems, 18 equations, 13 figures, 20 tables)

This paper contains 38 sections, 2 theorems, 18 equations, 13 figures, 20 tables.

Key Result

Theorem 3.1

Let $g_k=\nabla_{\mathbf{X}_\text{sou}} \mathcal{L}(f(\mathcal{T}_k(\mathbf{X}_\text{sou})), y)$ denote the gradient from $\mathcal{T}_k$, $\mu=\mathbb{E}[g_k], \sigma^2=\mathbb{E}[\lVert g_k-\mu \rVert_2^2]$ denote the mean and variance, and $p_{k\ell}$ denote the pair-wise correlation $p_{k\ell}= where $\overline{p} = \mathbb{E}[p_{kl}], \ k\neq \ell$ is the expectation of pair-wise correlation

Figures (13)

  • Figure 1: Improvement of $\mathtt{M\text{-}Attack\text{-}V2}$ over $\mathtt{M\text{-}Attack}$ on strong and up-to-date commercial black-box models (Claude 4, Gemini 2.5 and GPT-5). ASR and KMR stand for attack success rate and keyword matching rate, respectively.
  • Figure 2: Cosine similarity of gradients from random crops. (a) Similarity vs. IoU between two crops in a fixed iteration, showing rapid decay and values $< 0.1$ once IoU $< 0.8$. (b) Similarity between consecutive source gradients. V1's gradient similarity is almost zero, while v2 improves it to around 0.2. Results are averaged over 200 runs.
  • Figure 3: Comparison of: a) different trajectories against different $K$; b) gradient pattern of single crop alignment against multi-crop alignment (MCA). The gradient pattern of ResNet 50 remains consistent when large pixels are overlapped, while the gradient pattern of ViTs changes dramatically. MCA helps to smooth out this impact.
  • Figure 4: Visualization of adversarial samples under $\epsilon=8$.
  • Figure 5: $\mathtt{M\text{-}Attack\text{-}V2}$
  • ...and 8 more figures

Theorems & Definitions (2)

  • Theorem 3.1
  • Theorem 3.5