Black-box Attacks on Image Activity Prediction and its Natural Language Explanations
Alina Elena Baia, Valentina Poggioni, Andrea Cavallaro
TL;DR
This work addresses the vulnerability of a self-rationalizing multimodal activity recognition system to black-box unrestricted attacks on natural language explanations. By formalizing the problem with $I \in \mathbb{R}^{h \times w \times 3}$ and $M_E(I)=(a,e,I_e)$, the authors define two attack modes, $S1$ and $S2$, and implement two perturbation strategies—ColorFoolX and region-aware image editing—operating without surrogate models. The attack framework optimizes explanation similarity $Q_{\hat{T}}(I,\hat{I})$ and image similarity $Q_{\hat{I}}(I,\hat{I})$ using a hierarchical GA/ES setup and NSGA-II for multi-objective trade-offs, evaluated on the ACT-X dataset with NLX-GPT. Results show substantial attack success (up to approximately $77.5\%$ in $S2$) while maintaining near-original image quality, demonstrating that final outputs alone can be exploited to generate unfaithful explanations. The findings highlight the need for evaluation metrics and defenses for faithfulness in vision-language explanations and motivate future improvements to attention mechanisms.
Abstract
Explainable AI (XAI) methods aim to describe the decision process of deep neural networks. Early XAI methods produced visual explanations, whereas more recent techniques generate multimodal explanations that include textual information and visual representations. Visual XAI methods have been shown to be vulnerable to white-box and gray-box adversarial attacks, with an attacker having full or partial knowledge of and access to the target system. As the vulnerabilities of multimodal XAI models have not been examined, in this paper we assess for the first time the robustness to black-box attacks of the natural language explanations generated by a self-rationalizing image-based activity recognition model. We generate unrestricted, spatially variant perturbations that disrupt the association between the predictions and the corresponding explanations to mislead the model into generating unfaithful explanations. We show that we can create adversarial images that manipulate the explanations of an activity recognition model by having access only to its final output.
