Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning
Hongge Chen, Huan Zhang, Pin-Yu Chen, Jinfeng Yi, Cho-Jui Hsieh
TL;DR
This work introduces Show-and-Fool, a novel optimization-based framework to craft imperceptible adversarial perturbations that mislead neural image captioning systems into producing targeted captions or captions containing specified keywords. It formulates a constrained objective combining image distortion with CNN+RNN-specific attack losses, and solves it via an arctanh transformation and binary-search tuning of a regularization constant. Across MSCOCO, Show-and-Fool achieves high target-attack success rates (up to ~99%) with small perturbations and demonstrates strong cross-model transferability, revealing vulnerabilities in visual language grounding and the need for robust defenses. The findings underscore that captioning systems pose unique challenges beyond image classifiers and suggest directions for improving robustness, including adversarial training and architecture-aware defenses.
Abstract
Visual language grounding is widely studied in modern neural image captioning systems, which typically adopts an encoder-decoder framework consisting of two principal components: a convolutional neural network (CNN) for image feature extraction and a recurrent neural network (RNN) for language caption generation. To study the robustness of language grounding to adversarial perturbations in machine vision and perception, we propose Show-and-Fool, a novel algorithm for crafting adversarial examples in neural image captioning. The proposed algorithm provides two evaluation approaches, which check whether neural image captioning systems can be mislead to output some randomly chosen captions or keywords. Our extensive experiments show that our algorithm can successfully craft visually-similar adversarial examples with randomly targeted captions or keywords, and the adversarial examples can be made highly transferable to other image captioning systems. Consequently, our approach leads to new robustness implications of neural image captioning and novel insights in visual language grounding.
