Table of Contents
Fetching ...

VQAttack: Transferable Adversarial Attacks on Visual Question Answering via Pre-trained Models

Ziyi Yin, Muchao Ye, Tianrong Zhang, Jiaqi Wang, Han Liu, Jinghui Chen, Ting Wang, Fenglong Ma

TL;DR

VQAttack targets the robustness of Visual Question Answering under the prevalent pre-training & fine-tuning regime by generating transferable adversarial image-text pairs from a fixed pre-trained multimodal source model. The framework combines two novel modules: an LLM-enhanced image attack that leverages latent-feature disruption and masked-answer anti-recovery, and a cross-modal joint attack that updates image and text perturbations in a staged manner using gradient-informed word substitutions. Empirical results on VQAv2 and TextVQA across five VL models show that VQAttack outperforms state-of-the-art baselines in transferable attacks, revealing a significant security blind spot in current VQA pipelines. The work also provides extensive ablations, qualitative analyses, and insights into how shared information across pre-trained and downstream models amplifies vulnerability, with source code slated for release.

Abstract

Visual Question Answering (VQA) is a fundamental task in computer vision and natural language process fields. Although the ``pre-training & finetuning'' learning paradigm significantly improves the VQA performance, the adversarial robustness of such a learning paradigm has not been explored. In this paper, we delve into a new problem: using a pre-trained multimodal source model to create adversarial image-text pairs and then transferring them to attack the target VQA models. Correspondingly, we propose a novel VQAttack model, which can iteratively generate both image and text perturbations with the designed modules: the large language model (LLM)-enhanced image attack and the cross-modal joint attack module. At each iteration, the LLM-enhanced image attack module first optimizes the latent representation-based loss to generate feature-level image perturbations. Then it incorporates an LLM to further enhance the image perturbations by optimizing the designed masked answer anti-recovery loss. The cross-modal joint attack module will be triggered at a specific iteration, which updates the image and text perturbations sequentially. Notably, the text perturbation updates are based on both the learned gradients in the word embedding space and word synonym-based substitution. Experimental results on two VQA datasets with five validated models demonstrate the effectiveness of the proposed VQAttack in the transferable attack setting, compared with state-of-the-art baselines. This work reveals a significant blind spot in the ``pre-training & fine-tuning'' paradigm on VQA tasks. Source codes will be released.

VQAttack: Transferable Adversarial Attacks on Visual Question Answering via Pre-trained Models

TL;DR

VQAttack targets the robustness of Visual Question Answering under the prevalent pre-training & fine-tuning regime by generating transferable adversarial image-text pairs from a fixed pre-trained multimodal source model. The framework combines two novel modules: an LLM-enhanced image attack that leverages latent-feature disruption and masked-answer anti-recovery, and a cross-modal joint attack that updates image and text perturbations in a staged manner using gradient-informed word substitutions. Empirical results on VQAv2 and TextVQA across five VL models show that VQAttack outperforms state-of-the-art baselines in transferable attacks, revealing a significant security blind spot in current VQA pipelines. The work also provides extensive ablations, qualitative analyses, and insights into how shared information across pre-trained and downstream models amplifies vulnerability, with source code slated for release.

Abstract

Visual Question Answering (VQA) is a fundamental task in computer vision and natural language process fields. Although the ``pre-training & finetuning'' learning paradigm significantly improves the VQA performance, the adversarial robustness of such a learning paradigm has not been explored. In this paper, we delve into a new problem: using a pre-trained multimodal source model to create adversarial image-text pairs and then transferring them to attack the target VQA models. Correspondingly, we propose a novel VQAttack model, which can iteratively generate both image and text perturbations with the designed modules: the large language model (LLM)-enhanced image attack and the cross-modal joint attack module. At each iteration, the LLM-enhanced image attack module first optimizes the latent representation-based loss to generate feature-level image perturbations. Then it incorporates an LLM to further enhance the image perturbations by optimizing the designed masked answer anti-recovery loss. The cross-modal joint attack module will be triggered at a specific iteration, which updates the image and text perturbations sequentially. Notably, the text perturbation updates are based on both the learned gradients in the word embedding space and word synonym-based substitution. Experimental results on two VQA datasets with five validated models demonstrate the effectiveness of the proposed VQAttack in the transferable attack setting, compared with state-of-the-art baselines. This work reveals a significant blind spot in the ``pre-training & fine-tuning'' paradigm on VQA tasks. Source codes will be released.
Paper Structure (25 sections, 7 equations, 6 figures, 2 tables, 1 algorithm)

This paper contains 25 sections, 7 equations, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: An example of Transferable adversarial attacks on VQA via pre-trained models.
  • Figure 2: Overview of the proposed VQAttack.
  • Figure 3: Ablation study results on the source model TCL and the victim model VLMO-B.
  • Figure 4: Qualitative results of VQAttack on the VQAv2 dataset generated by the TCL model. The original answer and perturbed words are displayed in blue and red, respectively. The wrong prediction is shown with an underline.
  • Figure 5: An example of Transferable adversarial attacks on VQA via pre-trained models.
  • ...and 1 more figures