Table of Contents
Fetching ...

Iterative Adversarial Attack on Image-guided Story Ending Generation

Youze Wang, Wenbo Hu, Richang Hong

TL;DR

IgSEG models fuse image and text to generate endings, but are vulnerable to adversarial perturbations in multimodal settings. The authors introduce Iterative-attack, an iterative method that jointly perturbs text and the ending-related image by identifying important words via a cross-modal loss and applying a PGD-based image perturbation, maximizing the adversarial effect under a BLEU-based criterion. Across four IgSEG models and two datasets, Iterative-attack yields higher attack success rates and lower end-text quality than single-modal or non-iterative baselines, while preserving semantic similarity (Sim $\approx$ 0.95) and maintaining reasonable perplexity. The study extends to multimodal machine translation on Multi30K, showing strong cross-modal perturbation capability and highlighting the need for robust defenses and standardized benchmarks for multimodal text generation.

Abstract

Multimodal learning involves developing models that can integrate information from various sources like images and texts. In this field, multimodal text generation is a crucial aspect that involves processing data from multiple modalities and outputting text. The image-guided story ending generation (IgSEG) is a particularly significant task, targeting on an understanding of complex relationships between text and image data with a complete story text ending. Unfortunately, deep neural networks, which are the backbone of recent IgSEG models, are vulnerable to adversarial samples. Current adversarial attack methods mainly focus on single-modality data and do not analyze adversarial attacks for multimodal text generation tasks that use cross-modal information. To this end, we propose an iterative adversarial attack method (Iterative-attack) that fuses image and text modality attacks, allowing for an attack search for adversarial text and image in an more effective iterative way. Experimental results demonstrate that the proposed method outperforms existing single-modal and non-iterative multimodal attack methods, indicating the potential for improving the adversarial robustness of multimodal text generation models, such as multimodal machine translation, multimodal question answering, etc.

Iterative Adversarial Attack on Image-guided Story Ending Generation

TL;DR

IgSEG models fuse image and text to generate endings, but are vulnerable to adversarial perturbations in multimodal settings. The authors introduce Iterative-attack, an iterative method that jointly perturbs text and the ending-related image by identifying important words via a cross-modal loss and applying a PGD-based image perturbation, maximizing the adversarial effect under a BLEU-based criterion. Across four IgSEG models and two datasets, Iterative-attack yields higher attack success rates and lower end-text quality than single-modal or non-iterative baselines, while preserving semantic similarity (Sim 0.95) and maintaining reasonable perplexity. The study extends to multimodal machine translation on Multi30K, showing strong cross-modal perturbation capability and highlighting the need for robust defenses and standardized benchmarks for multimodal text generation.

Abstract

Multimodal learning involves developing models that can integrate information from various sources like images and texts. In this field, multimodal text generation is a crucial aspect that involves processing data from multiple modalities and outputting text. The image-guided story ending generation (IgSEG) is a particularly significant task, targeting on an understanding of complex relationships between text and image data with a complete story text ending. Unfortunately, deep neural networks, which are the backbone of recent IgSEG models, are vulnerable to adversarial samples. Current adversarial attack methods mainly focus on single-modality data and do not analyze adversarial attacks for multimodal text generation tasks that use cross-modal information. To this end, we propose an iterative adversarial attack method (Iterative-attack) that fuses image and text modality attacks, allowing for an attack search for adversarial text and image in an more effective iterative way. Experimental results demonstrate that the proposed method outperforms existing single-modal and non-iterative multimodal attack methods, indicating the potential for improving the adversarial robustness of multimodal text generation models, such as multimodal machine translation, multimodal question answering, etc.
Paper Structure (26 sections, 5 equations, 6 figures, 10 tables, 1 algorithm)

This paper contains 26 sections, 5 equations, 6 figures, 10 tables, 1 algorithm.

Figures (6)

  • Figure 1: An example of the bottleneck single modality adversarial attack against the multimodal text generation model. The blue arrows denote the information chain, and the heat map shows where the target model focuses on the image. When a single modality adversarial example attacks the target model, the other unperturbed modality data may provide complementary information, making the attack fail.
  • Figure 2: Illustration of Iterative-attack. We fuse the image modality attack into the text modality attack to iteratively find the most vulnerable multimodal information patch, which can avoid the dilemma that the information shift caused by a single-modal adversarial attack may be corrected by another modality’s information.
  • Figure 3: The Grad-CAM visualizations of (a) the original example $(x_t, x_i)$, (b) the adversarial example $(x'_t, x'_i)$ derived by Iterative-attack against MMT on VIST-E dataset where the adversarial perturbation is obtained by $x'_i - x_i$ ( pixel values of perturbation are amplified ×20 for visualization).
  • Figure 4: The Grad-CAM visualizations of (c) the original example $(x_t, x_i)$, (d) the adversarial example $(x'_t, x'_i)$ derived by Iterative-attack against MMT on VIST-E dataset, where the adversarial perturbation is obtained by $x'_i - x_i$ ( pixel values of perturbation are amplified ×20 for visualization).
  • Figure 5: The Grad-CAM visualizations of (e) the original example $(x_t, x_i)$, (f) the adversarial example $(x'_t, x'_i)$ derived by Iterative-attack against MMT on VIST-E dataset where the adversarial perturbation is obtained by $x'_i - x_i$ ( pixel values of perturbation are amplified ×20 for visualization).
  • ...and 1 more figures