Table of Contents
Fetching ...

Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning

Hongge Chen, Huan Zhang, Pin-Yu Chen, Jinfeng Yi, Cho-Jui Hsieh

TL;DR

This work introduces Show-and-Fool, a novel optimization-based framework to craft imperceptible adversarial perturbations that mislead neural image captioning systems into producing targeted captions or captions containing specified keywords. It formulates a constrained objective combining image distortion with CNN+RNN-specific attack losses, and solves it via an arctanh transformation and binary-search tuning of a regularization constant. Across MSCOCO, Show-and-Fool achieves high target-attack success rates (up to ~99%) with small perturbations and demonstrates strong cross-model transferability, revealing vulnerabilities in visual language grounding and the need for robust defenses. The findings underscore that captioning systems pose unique challenges beyond image classifiers and suggest directions for improving robustness, including adversarial training and architecture-aware defenses.

Abstract

Visual language grounding is widely studied in modern neural image captioning systems, which typically adopts an encoder-decoder framework consisting of two principal components: a convolutional neural network (CNN) for image feature extraction and a recurrent neural network (RNN) for language caption generation. To study the robustness of language grounding to adversarial perturbations in machine vision and perception, we propose Show-and-Fool, a novel algorithm for crafting adversarial examples in neural image captioning. The proposed algorithm provides two evaluation approaches, which check whether neural image captioning systems can be mislead to output some randomly chosen captions or keywords. Our extensive experiments show that our algorithm can successfully craft visually-similar adversarial examples with randomly targeted captions or keywords, and the adversarial examples can be made highly transferable to other image captioning systems. Consequently, our approach leads to new robustness implications of neural image captioning and novel insights in visual language grounding.

Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning

TL;DR

This work introduces Show-and-Fool, a novel optimization-based framework to craft imperceptible adversarial perturbations that mislead neural image captioning systems into producing targeted captions or captions containing specified keywords. It formulates a constrained objective combining image distortion with CNN+RNN-specific attack losses, and solves it via an arctanh transformation and binary-search tuning of a regularization constant. Across MSCOCO, Show-and-Fool achieves high target-attack success rates (up to ~99%) with small perturbations and demonstrates strong cross-model transferability, revealing vulnerabilities in visual language grounding and the need for robust defenses. The findings underscore that captioning systems pose unique challenges beyond image classifiers and suggest directions for improving robustness, including adversarial training and architecture-aware defenses.

Abstract

Visual language grounding is widely studied in modern neural image captioning systems, which typically adopts an encoder-decoder framework consisting of two principal components: a convolutional neural network (CNN) for image feature extraction and a recurrent neural network (RNN) for language caption generation. To study the robustness of language grounding to adversarial perturbations in machine vision and perception, we propose Show-and-Fool, a novel algorithm for crafting adversarial examples in neural image captioning. The proposed algorithm provides two evaluation approaches, which check whether neural image captioning systems can be mislead to output some randomly chosen captions or keywords. Our extensive experiments show that our algorithm can successfully craft visually-similar adversarial examples with randomly targeted captions or keywords, and the adversarial examples can be made highly transferable to other image captioning systems. Consequently, our approach leads to new robustness implications of neural image captioning and novel insights in visual language grounding.

Paper Structure

This paper contains 19 sections, 24 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Adversarial examples crafted by Show-and-Fool using the targeted caption method. The target captioning model is Show-and-Tell DBLP:conf/cvpr/VinyalsTBE15, the original images are selected from the MSCOCO validation set, and the targeted captions are randomly selected from the top-1 inferred caption of other validation images.
  • Figure 2: An adversarial example ($\|\delta\|_2 = 1.284$) of an cake image crafted by the Show-and-Fool targeted keyword method with three keywords - "dog", "cat" and "frisbee".
  • Figure 3: A highly transferable adversarial example ($\|\delta\|_2 = 15.226$) crafted by Show-and-Tell targeted caption method, transfers to Show-Attend-and-Tell, yielding similar adversarial captions.
  • Figure 4: Adversarial example ($\|\delta\|_2 = 2.977$) of an elephant image crafted by the Show-and-Fool targeted caption method with the target caption "A black and white photo of a group of people".
  • Figure 5: Adversarial example ($\|\delta\|_2 = 2.979$) of an clock image crafted by the Show-and-Fool targeted keyword method with three keywords: "meat", "white" and "topped".
  • ...and 8 more figures