Table of Contents
Fetching ...

A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1

Zhaoyi Li, Xiaohan Zhao, Dong-Dong Wu, Jiacheng Cui, Zhiqiang Shen

TL;DR

The paper introduces M-Attack, a simple yet potent transfer-based attack against strong black-box LVLMs by embedding semantic details within localized regions through random cropping and embedding-space alignment. It combines local-to-local/global matching with a model ensemble to capture shared semantics across surrogate models, achieving over 90% transfer success to commercial LVLMs like GPT-4.5/4o/o1. A new KMRScore metric complements ASR to quantify semantic transfer, and extensive ablations demonstrate the necessity of local semantics and ensemble synergy. The work highlights significant security vulnerabilities in current commercial multimodal systems and motivates stronger defenses and standardized evaluation. Practically, the approach sets a new baseline for evaluating robustness of black-box LVLMs and informs defense design against semantically guided perturbations.

Abstract

Despite promising performance on open-source large vision-language models (LVLMs), transfer-based targeted attacks often fail against closed-source commercial LVLMs. Analyzing failed adversarial perturbations reveals that the learned perturbations typically originate from a uniform distribution and lack clear semantic details, resulting in unintended responses. This critical absence of semantic information leads commercial black-box LVLMs to either ignore the perturbation entirely or misinterpret its embedded semantics, thereby causing the attack to fail. To overcome these issues, we propose to refine semantic clarity by encoding explicit semantic details within local regions, thus ensuring the capture of finer-grained features and inter-model transferability, and by concentrating modifications on semantically rich areas rather than applying them uniformly. To achieve this, we propose a simple yet highly effective baseline: at each optimization step, the adversarial image is cropped randomly by a controlled aspect ratio and scale, resized, and then aligned with the target image in the embedding space. While the naive source-target matching method has been utilized before in the literature, we are the first to provide a tight analysis, which establishes a close connection between perturbation optimization and semantics. Experimental results confirm our hypothesis. Our adversarial examples crafted with local-aggregated perturbations focused on crucial regions exhibit surprisingly good transferability to commercial LVLMs, including GPT-4.5, GPT-4o, Gemini-2.0-flash, Claude-3.5/3.7-sonnet, and even reasoning models like o1, Claude-3.7-thinking and Gemini-2.0-flash-thinking. Our approach achieves success rates exceeding 90% on GPT-4.5, 4o, and o1, significantly outperforming all prior state-of-the-art attack methods with lower $\ell_1/\ell_2$ perturbations.

A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1

TL;DR

The paper introduces M-Attack, a simple yet potent transfer-based attack against strong black-box LVLMs by embedding semantic details within localized regions through random cropping and embedding-space alignment. It combines local-to-local/global matching with a model ensemble to capture shared semantics across surrogate models, achieving over 90% transfer success to commercial LVLMs like GPT-4.5/4o/o1. A new KMRScore metric complements ASR to quantify semantic transfer, and extensive ablations demonstrate the necessity of local semantics and ensemble synergy. The work highlights significant security vulnerabilities in current commercial multimodal systems and motivates stronger defenses and standardized evaluation. Practically, the approach sets a new baseline for evaluating robustness of black-box LVLMs and informs defense design against semantically guided perturbations.

Abstract

Despite promising performance on open-source large vision-language models (LVLMs), transfer-based targeted attacks often fail against closed-source commercial LVLMs. Analyzing failed adversarial perturbations reveals that the learned perturbations typically originate from a uniform distribution and lack clear semantic details, resulting in unintended responses. This critical absence of semantic information leads commercial black-box LVLMs to either ignore the perturbation entirely or misinterpret its embedded semantics, thereby causing the attack to fail. To overcome these issues, we propose to refine semantic clarity by encoding explicit semantic details within local regions, thus ensuring the capture of finer-grained features and inter-model transferability, and by concentrating modifications on semantically rich areas rather than applying them uniformly. To achieve this, we propose a simple yet highly effective baseline: at each optimization step, the adversarial image is cropped randomly by a controlled aspect ratio and scale, resized, and then aligned with the target image in the embedding space. While the naive source-target matching method has been utilized before in the literature, we are the first to provide a tight analysis, which establishes a close connection between perturbation optimization and semantics. Experimental results confirm our hypothesis. Our adversarial examples crafted with local-aggregated perturbations focused on crucial regions exhibit surprisingly good transferability to commercial LVLMs, including GPT-4.5, GPT-4o, Gemini-2.0-flash, Claude-3.5/3.7-sonnet, and even reasoning models like o1, Claude-3.7-thinking and Gemini-2.0-flash-thinking. Our approach achieves success rates exceeding 90% on GPT-4.5, 4o, and o1, significantly outperforming all prior state-of-the-art attack methods with lower perturbations.

Paper Structure

This paper contains 30 sections, 1 theorem, 10 equations, 17 figures, 16 tables, 3 algorithms.

Key Result

Proposition B.1

Let $\mu_S^G$ and $\mu_T^G$ denote the global distributions of the source image $\hat{\mathbf{x}}^s + \delta$ and target image $\hat{\mathbf{x}}^t$, respectively, obtained by representing each image as a single feature vector. Let $\mu_S^L$ and $\mu_T^L$ denote the corresponding local distributions, where the optimal transport (OT) distance is defined by Here, $f$ is a feature extractor, ${\mathb

Figures (17)

  • Figure 1: Examples from closed-source LVLMs to targeted attacks generated by our method.
  • Figure 2: Illustration of our proposed framework. Our method is based on two components: Local-to-Global or Local-to-Local Matching (LM) and Model Ensemble (ENS). LM is the core of our approach, which helps to refine the local semantics of the perturbation. ENS helps to avoid overly relying on single models embedding similarity, thus improving attack transferability.
  • Figure 3: Empirical cumulative distribution vs. uniform distribution on 20 randomly-sampled failed adversarial images. Shading shows standard deviation.
  • Figure 4: Comparison of global similarity and ASR across different matching schemes: Global to Global, Local to Global and Local to Local.
  • Figure 5: Simulated heatmap visualization of perturbation aggregation across various steps using different crop schemes. The scales control the range of proportions to the original image area.
  • ...and 12 more figures

Theorems & Definitions (2)

  • Proposition B.1: Local-to-Local Transport Yields Lower Alignment Cost
  • proof : Proof Sketch