Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting

Xiaohan Zhao; Zhaoyi Li; Yaxin Luo; Jiacheng Cui; Zhiqiang Shen

Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting

Xiaohan Zhao, Zhaoyi Li, Yaxin Luo, Jiacheng Cui, Zhiqiang Shen

TL;DR

M-Attack-V2 is a simple, modular enhancement over M-Attack that substantially improves transfer-based black-box attacks on frontier LVLMs: boosting success rates on Claude-4.0 from 8% to 30%, Gemini-2.5-Pro from 83% to 97%, and GPT-5 from 98% to 100%, outperforming prior black-box LVLM attacks.

Abstract

Black-box adversarial attacks on Large Vision-Language Models (LVLMs) are challenging due to missing gradients and complex multimodal boundaries. While prior state-of-the-art transfer-based approaches like M-Attack perform well using local crop-level matching between source and target images, we find this induces high-variance, nearly orthogonal gradients across iterations, violating coherent local alignment and destabilizing optimization. We attribute this to (i) ViT translation sensitivity that yields spike-like gradients and (ii) structural asymmetry between source and target crops. We reformulate local matching as an asymmetric expectation over source transformations and target semantics, and build a gradient-denoising upgrade to M-Attack. On the source side, Multi-Crop Alignment (MCA) averages gradients from multiple independently sampled local views per iteration to reduce variance. On the target side, Auxiliary Target Alignment (ATA) replaces aggressive target augmentation with a small auxiliary set from a semantically correlated distribution, producing a smoother, lower-variance target manifold. We further reinterpret momentum as Patch Momentum, replaying historical crop gradients; combined with a refined patch-size ensemble (PE+), this strengthens transferable directions. Together these modules form M-Attack-V2, a simple, modular enhancement over M-Attack that substantially improves transfer-based black-box attacks on frontier LVLMs: boosting success rates on Claude-4.0 from 8% to 30%, Gemini-2.5-Pro from 83% to 97%, and GPT-5 from 98% to 100%, outperforming prior black-box LVLM attacks. Code and data are publicly available at: https://github.com/vila-lab/M-Attack-V2.

Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting

TL;DR

Abstract

Paper Structure (38 sections, 2 theorems, 18 equations, 13 figures, 20 tables)

This paper contains 38 sections, 2 theorems, 18 equations, 13 figures, 20 tables.

Introduction
Related Work
Approach
Preliminaries and Limitations in Prior Local-level Matching-Based Methods
Asymmetric Matching over Expectation
Gradient Denoising via Multi-Crop Alignment
Improved Sampling Quality via Auxiliary Target Alignment
Patch Momentum with Built-in Replay Effect
Experiments
Experimental Setup
Selection of Surrogate Model
Evaluation Across LVLMs and Settings
Ablation Study
Conclusion
Additional Details for Theoretical Analysis
...and 23 more sections

Key Result

Theorem 3.1

Let $g_k=\nabla_{\mathbf{X}_\text{sou}} \mathcal{L}(f(\mathcal{T}_k(\mathbf{X}_\text{sou})), y)$ denote the gradient from $\mathcal{T}_k$, $\mu=\mathbb{E}[g_k], \sigma^2=\mathbb{E}[\lVert g_k-\mu \rVert_2^2]$ denote the mean and variance, and $p_{k\ell}$ denote the pair-wise correlation $p_{k\ell}= where $\overline{p} = \mathbb{E}[p_{kl}], \ k\neq \ell$ is the expectation of pair-wise correlation

Figures (13)

Figure 1: Improvement of $\mathtt{M\text{-}Attack\text{-}V2}$ over $\mathtt{M\text{-}Attack}$ on strong and up-to-date commercial black-box models (Claude 4, Gemini 2.5 and GPT-5). ASR and KMR stand for attack success rate and keyword matching rate, respectively.
Figure 2: Cosine similarity of gradients from random crops. (a) Similarity vs. IoU between two crops in a fixed iteration, showing rapid decay and values $< 0.1$ once IoU $< 0.8$. (b) Similarity between consecutive source gradients. V1's gradient similarity is almost zero, while v2 improves it to around 0.2. Results are averaged over 200 runs.
Figure 3: Comparison of: a) different trajectories against different $K$; b) gradient pattern of single crop alignment against multi-crop alignment (MCA). The gradient pattern of ResNet 50 remains consistent when large pixels are overlapped, while the gradient pattern of ViTs changes dramatically. MCA helps to smooth out this impact.
Figure 4: Visualization of adversarial samples under $\epsilon=8$.
Figure 5: $\mathtt{M\text{-}Attack\text{-}V2}$
...and 8 more figures

Theorems & Definitions (2)

Theorem 3.1
Theorem 3.5

Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting

TL;DR

Abstract

Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (2)