Black-box Attacks on Image Activity Prediction and its Natural Language Explanations

Alina Elena Baia; Valentina Poggioni; Andrea Cavallaro

Black-box Attacks on Image Activity Prediction and its Natural Language Explanations

Alina Elena Baia, Valentina Poggioni, Andrea Cavallaro

TL;DR

This work addresses the vulnerability of a self-rationalizing multimodal activity recognition system to black-box unrestricted attacks on natural language explanations. By formalizing the problem with $I \in \mathbb{R}^{h \times w \times 3}$ and $M_E(I)=(a,e,I_e)$, the authors define two attack modes, $S1$ and $S2$, and implement two perturbation strategies—ColorFoolX and region-aware image editing—operating without surrogate models. The attack framework optimizes explanation similarity $Q_{\hat{T}}(I,\hat{I})$ and image similarity $Q_{\hat{I}}(I,\hat{I})$ using a hierarchical GA/ES setup and NSGA-II for multi-objective trade-offs, evaluated on the ACT-X dataset with NLX-GPT. Results show substantial attack success (up to approximately $77.5\%$ in $S2$) while maintaining near-original image quality, demonstrating that final outputs alone can be exploited to generate unfaithful explanations. The findings highlight the need for evaluation metrics and defenses for faithfulness in vision-language explanations and motivate future improvements to attention mechanisms.

Abstract

Explainable AI (XAI) methods aim to describe the decision process of deep neural networks. Early XAI methods produced visual explanations, whereas more recent techniques generate multimodal explanations that include textual information and visual representations. Visual XAI methods have been shown to be vulnerable to white-box and gray-box adversarial attacks, with an attacker having full or partial knowledge of and access to the target system. As the vulnerabilities of multimodal XAI models have not been examined, in this paper we assess for the first time the robustness to black-box attacks of the natural language explanations generated by a self-rationalizing image-based activity recognition model. We generate unrestricted, spatially variant perturbations that disrupt the association between the predictions and the corresponding explanations to mislead the model into generating unfaithful explanations. We show that we can create adversarial images that manipulate the explanations of an activity recognition model by having access only to its final output.

Black-box Attacks on Image Activity Prediction and its Natural Language Explanations

TL;DR

and

, the authors define two attack modes,

and

, and implement two perturbation strategies—ColorFoolX and region-aware image editing—operating without surrogate models. The attack framework optimizes explanation similarity

and image similarity

using a hierarchical GA/ES setup and NSGA-II for multi-objective trade-offs, evaluated on the ACT-X dataset with NLX-GPT. Results show substantial attack success (up to approximately

) while maintaining near-original image quality, demonstrating that final outputs alone can be exploited to generate unfaithful explanations. The findings highlight the need for evaluation metrics and defenses for faithfulness in vision-language explanations and motivate future improvements to attention mechanisms.

Abstract

Paper Structure (10 sections, 12 equations, 6 figures, 4 tables)

This paper contains 10 sections, 12 equations, 6 figures, 4 tables.

Introduction
Related works
Methodology
Problem definition
Black-box unrestricted attacks
Validation
Experimental setup
Performance evaluation
Results and Discussion
Conclusion

Figures (6)

Figure 1: Sample adversarial images generated against NLX-GPT sammani2022nlx from a clean image (left) by changing the activity prediction while maintaining the textual explanation (middle) and by maintaining the activity prediction while changing the textual explanation (right).
Figure 2: Example of semantic regions obtained after the first step (middle) and last step (right) of the multi-step segmentation scheme. Regions in brown are considered sensitive to color changes.
Figure 3: Mapping between explanation groups and similarity classes. KEY -- C1: not similar at all, C2: a little similar, C3: somehow similar, C4: very similar, C5: they are the same. Explanations pairs with $Q_{\hat{T}} >0.85$ (i.e. G1-G3) are rated as highly similar.
Figure 4: Adversarial images generated for a clean image (top left). The visual explanation maps for the activity prediction are shown next to each image. For $S1$ the images have a different activity and the textual explanations are similar. For $S2$ the images have the same activity but different textual explanations. The MANIQA scores for the images are 0.69, 0.63, 0.70, 0.72, 0.64, from top to bottom, respectively.
Figure 5: Colorfulness scores distribution for $S1$ (top row) and for $S2$ (bottom row). The adversarial examples generated with LC-m and FL-m have colors similar to the original images. In the case of CFX, the colors of adversarial examples diverge from the distribution of original images. The higher the score, the more colorful the image.
...and 1 more figures

Black-box Attacks on Image Activity Prediction and its Natural Language Explanations

TL;DR

Abstract

Black-box Attacks on Image Activity Prediction and its Natural Language Explanations

Authors

TL;DR

Abstract

Table of Contents

Figures (6)