Ask, Attend, Attack: A Effective Decision-Based Black-Box Targeted Attack for Image-to-Text Models

Qingyuan Zeng; Zhenzhong Wang; Yiu-ming Cheung; Min Jiang

Ask, Attend, Attack: A Effective Decision-Based Black-Box Targeted Attack for Image-to-Text Models

Qingyuan Zeng, Zhenzhong Wang, Yiu-ming Cheung, Min Jiang

TL;DR

The paper tackles targeted adversarial attacks on image-to-text models under a stringent decision-based black-box setting, where only the final output text is observable. It introduces the Ask, Attend, Attack (AAA) framework, which first crafts a target semantic dictionary (Ask), localizes influential image regions with a Grad-CAM-inspired attention map from a surrogate model (Attend), and then searches for imperceptible perturbations in the reduced space using differential evolution (Attack) to drive the model toward the specified target text. AAA leverages a WordNet-based semantic metric, CLIP-based text similarity, and an attention-guided search to avoid semantic loss that plagues gray-box methods, demonstrating superior targeted attack performance on both Transformer-based VIT-GPT2 and CNN+RNN Show-Attend-Tell models in extensive experiments with Flick30k. The results highlight notable vulnerabilities in contemporary vision-language systems under realistic black-box constraints and underscore the need for robust defenses and further study of semantic-consistent adversarial strategies. Together, the framework advances practical black-box targeted attacks while providing insights into attention-based search and semantic preservation in image-to-text threats.

Abstract

While image-to-text models have demonstrated significant advancements in various vision-language tasks, they remain susceptible to adversarial attacks. Existing white-box attacks on image-to-text models require access to the architecture, gradients, and parameters of the target model, resulting in low practicality. Although the recently proposed gray-box attacks have improved practicality, they suffer from semantic loss during the training process, which limits their targeted attack performance. To advance adversarial attacks of image-to-text models, this paper focuses on a challenging scenario: decision-based black-box targeted attacks where the attackers only have access to the final output text and aim to perform targeted attacks. Specifically, we formulate the decision-based black-box targeted attack as a large-scale optimization problem. To efficiently solve the optimization problem, a three-stage process \textit{Ask, Attend, Attack}, called \textit{AAA}, is proposed to coordinate with the solver. \textit{Ask} guides attackers to create target texts that satisfy the specific semantics. \textit{Attend} identifies the crucial regions of the image for attacking, thus reducing the search space for the subsequent \textit{Attack}. \textit{Attack} uses an evolutionary algorithm to attack the crucial regions, where the attacks are semantically related to the target texts of \textit{Ask}, thus achieving targeted attacks without semantic loss. Experimental results on transformer-based and CNN+RNN-based image-to-text models confirmed the effectiveness of our proposed \textit{AAA}.

Ask, Attend, Attack: A Effective Decision-Based Black-Box Targeted Attack for Image-to-Text Models

TL;DR

Abstract

Paper Structure (22 sections, 13 equations, 6 figures, 1 table)

This paper contains 22 sections, 13 equations, 6 figures, 1 table.

Introduction
Related work
White-box Attack
Gray-box Attack
Methodology
Problem Formulation
Overview
Ask Stage
Attend Stage
Attack Stage
Evaluation and Results
Experiment setups
Model and dataset
Evaluation metrics
Experiment results
...and 7 more sections

Figures (6)

Figure 1: The semantic loss problem existing in existing gray-box targeted attack methods.
Figure 2: Diagram of our decision-based black-box targeted attack method Ask, Attend, Attack.
Figure 3: We compared the convergence curves of populations with and without Attend under the same perturbation size $\epsilon$ in (a-b). The fitness function is $S_{clip}$ in Formula \ref{['eqa:CLIP']}, where lower values mean stronger attacks. The dashed line is the average fitness value, and the solid line is the best fitness value. The green line is AAA and the red line is AAA w/o Attend. (c) shows the attention heatmap. (d) and (e) show the visual effects of adversarial image with and without $\textit{Attend}$, with minimal perturbation of 100% attack success rate.
Figure 4: Grad-CAM attention heatmaps of different surrogate models for the same target text a woman is holding a pair of shoes. M is METEOR, B is BLEU, C is CLIP, S is SPICE.
Figure 5: Performance of adversarial image attacks varies with perturbation size $\epsilon$. The $\epsilon$ of (a) and (f) is 25, $\epsilon$ of (b) and (g) is 15, $\epsilon$ of (c) and (h) is 10, $\epsilon$ of (d) and (i) is 5. (e) is our attention heatmap of the target text on the image. (j) is the target image generated based on the target text used in existing works. M is METEOR score, B is BLEU score, and C is CLIP score.
...and 1 more figures

Ask, Attend, Attack: A Effective Decision-Based Black-Box Targeted Attack for Image-to-Text Models

TL;DR

Abstract

Ask, Attend, Attack: A Effective Decision-Based Black-Box Targeted Attack for Image-to-Text Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)