Table of Contents
Fetching ...

PDA: Text-Augmented Defense Framework for Robust Vision-Language Models against Adversarial Image Attacks

Jingning Xu, Haochen Luo, Chen Liu

Abstract

Vision-language models (VLMs) are vulnerable to adversarial image perturbations. Existing works based on adversarial training against task-specific adversarial examples are computationally expensive and often fail to generalize to unseen attack types. To address these limitations, we introduce Paraphrase-Decomposition-Aggregation (PDA), a training-free defense framework that leverages text augmentation to enhance VLM robustness under diverse adversarial image attacks. PDA performs prompt paraphrasing, question decomposition, and consistency aggregation entirely at test time, thus requiring no modification on the underlying models. To balance robustness and efficiency, we instantiate PDA as invariants that reduce the inference cost while retaining most of its robustness gains. Experiments on multiple VLM architectures and benchmarks for visual question answering, classification, and captioning show that PDA achieves consistent robustness gains against various adversarial perturbations while maintaining competitive clean accuracy, establishing a generic, strong and practical defense framework for VLMs during inference.

PDA: Text-Augmented Defense Framework for Robust Vision-Language Models against Adversarial Image Attacks

Abstract

Vision-language models (VLMs) are vulnerable to adversarial image perturbations. Existing works based on adversarial training against task-specific adversarial examples are computationally expensive and often fail to generalize to unseen attack types. To address these limitations, we introduce Paraphrase-Decomposition-Aggregation (PDA), a training-free defense framework that leverages text augmentation to enhance VLM robustness under diverse adversarial image attacks. PDA performs prompt paraphrasing, question decomposition, and consistency aggregation entirely at test time, thus requiring no modification on the underlying models. To balance robustness and efficiency, we instantiate PDA as invariants that reduce the inference cost while retaining most of its robustness gains. Experiments on multiple VLM architectures and benchmarks for visual question answering, classification, and captioning show that PDA achieves consistent robustness gains against various adversarial perturbations while maintaining competitive clean accuracy, establishing a generic, strong and practical defense framework for VLMs during inference.

Paper Structure

This paper contains 22 sections, 4 equations, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overview of the threat model and PDA.(a) Attack phase: an adversary adds pixel-level perturbations guided by VLM gradients, causing the model to flip its answer under the original question. (b) Defense phase: our training-free, text-side Paraphrase–Decomposition–Aggregation pipeline, requires only black-box access, and suppresses adversarially biased views by cross-checking stable visual evidence.
  • Figure 2: Pipeline of the proposed PDA defense and its variants. Given an image $x$ and a query $t$, the Paraphrase Agent produces semantically equivalent prompts. Each prompt is factorized by the Question Decomposer into atomic questions that are individually answered by the base VLM. The Answer Aggregation Agent performs confidence- and consistency-aware fusion to yield the final prediction. The bottom panels illustrate four PDA variants: (a) PDA-Full, (b) PDA-RJV, (c) PDA-RDA and PDA-PV.
  • Figure 3: Qualitative effect of PDA. (a) The base VLM is attacked into choosing "jeans" in a two-way "t shirt vs. jeans" question, whereas PDA paraphrases the query into targeted checks and recovers the correct "t shirt" label. (b) For a cluttered scene with a black-and-white cat in a green bowl, PDA decomposes the caption prompt into factual sub-questions and aggregates the answers into a faithful caption.
  • Figure 4: Overall performance of PDA variants on VQA-v2 and ImageNet-D under different paraphrase counts ($K{=}3$ and $K{=}5$).
  • Figure 5: Additional VQA-v2 qualitative examples. For each adversarial image we compare the answer of the undefended VLM with the output of PDA and visualize key paraphrases and sub-questions that support the corrected decision.
  • ...and 2 more figures