Table of Contents
Fetching ...

AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization

Chaohu Liu, Tianyi Gui, Yu Liu, Linli Xu

TL;DR

This paper addresses the vulnerability of large vision-language models (LVLMs) to adversarial inputs and proposes AdPO, a defense that reframes adversarial training as preference optimization using a DPO-style objective. By updating only the image encoder and employing online, unsupervised preferred-image signals (preferred vs non-preferred interpretations) together with an adversarial-image optimization term, AdPO achieves superior clean accuracy and adversarial robustness. Training on a smaller LVLM (TinyLLaVA) and transferring to larger LVLMs yields competitive results and efficiency akin to prior methods, with strong performance on downstream tasks like image captioning and VQA under untargeted and targeted attacks. The work introduces a novel multimodal defense perspective, showing that preference optimization can guide robust representations without extensive annotation or full-model fine-tuning, and it highlights avenues for refining DPO-based defenses in multimodal settings.

Abstract

Large Vision-Language Models (LVLMs), such as GPT-4o and LLaVA, have recently witnessed remarkable advancements and are increasingly being deployed in real-world applications. However, inheriting the sensitivity of visual neural networks, LVLMs remain vulnerable to adversarial attacks, which can result in erroneous or malicious outputs. While existing efforts utilize adversarial fine-tuning to enhance robustness, they often suffer from performance degradation on clean inputs. In this paper, we proposes AdPO, a novel adversarial defense strategy for LVLMs based on preference optimization. For the first time, we reframe adversarial training as a preference optimization problem, aiming to enhance the model's preference for generating normal outputs on clean inputs while rejecting the potential misleading outputs for adversarial examples. Notably, AdPO achieves this by solely modifying the image encoder, e.g., CLIP ViT, resulting in superior clean and adversarial performance in a variety of downsream tasks. Considering that training involves large language models (LLMs), the computational cost increases significantly. We validate that training on smaller LVLMs and subsequently transferring to larger models can achieve competitive performance while maintaining efficiency comparable to baseline methods. Our comprehensive experiments confirm the effectiveness of the proposed AdPO, which provides a novel perspective for future adversarial defense research.

AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization

TL;DR

This paper addresses the vulnerability of large vision-language models (LVLMs) to adversarial inputs and proposes AdPO, a defense that reframes adversarial training as preference optimization using a DPO-style objective. By updating only the image encoder and employing online, unsupervised preferred-image signals (preferred vs non-preferred interpretations) together with an adversarial-image optimization term, AdPO achieves superior clean accuracy and adversarial robustness. Training on a smaller LVLM (TinyLLaVA) and transferring to larger LVLMs yields competitive results and efficiency akin to prior methods, with strong performance on downstream tasks like image captioning and VQA under untargeted and targeted attacks. The work introduces a novel multimodal defense perspective, showing that preference optimization can guide robust representations without extensive annotation or full-model fine-tuning, and it highlights avenues for refining DPO-based defenses in multimodal settings.

Abstract

Large Vision-Language Models (LVLMs), such as GPT-4o and LLaVA, have recently witnessed remarkable advancements and are increasingly being deployed in real-world applications. However, inheriting the sensitivity of visual neural networks, LVLMs remain vulnerable to adversarial attacks, which can result in erroneous or malicious outputs. While existing efforts utilize adversarial fine-tuning to enhance robustness, they often suffer from performance degradation on clean inputs. In this paper, we proposes AdPO, a novel adversarial defense strategy for LVLMs based on preference optimization. For the first time, we reframe adversarial training as a preference optimization problem, aiming to enhance the model's preference for generating normal outputs on clean inputs while rejecting the potential misleading outputs for adversarial examples. Notably, AdPO achieves this by solely modifying the image encoder, e.g., CLIP ViT, resulting in superior clean and adversarial performance in a variety of downsream tasks. Considering that training involves large language models (LLMs), the computational cost increases significantly. We validate that training on smaller LVLMs and subsequently transferring to larger models can achieve competitive performance while maintaining efficiency comparable to baseline methods. Our comprehensive experiments confirm the effectiveness of the proposed AdPO, which provides a novel perspective for future adversarial defense research.

Paper Structure

This paper contains 13 sections, 7 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Illustration of untargeted adversarial attacks on LLaVA using different CLIP models as encoders. The original model can produce accurate outputs on clean images, but it makes significant errors when faced with adversarial attacks. Although the adversarially trained versions, TeCoA and FARE, have better adversarial robustness, they still tend to hallucinate or fail to fully comprehend the image. Comparatively, our AdPO exhibits strong performance on both clean and adversarial images.
  • Figure 2: The architecture of our proposed AdPO. AdPO mainly consists of two parts: (left) preferred image optimization and (right) adversarial image optimization. Preferred image optimization incorporates both clean and adversarial images into adversarial training while maintaining the model’s performance on clean inputs, and adversarial image optimization can significantly enhance the model’s adversarial robustness.
  • Figure 3: Qualitative assessment of targeted attacks on LLaVA. (Left) When encountering clean images, CoTeA may exhibit noticeable errors, which is undesirable in adversarial defense, while FARE and AdPO demonstrate better clean performance. (Right) When faced with adversarial images, the original CLIP version of LLaVA is easily compromised, FARE shows some adversarial robustness but loses more details or makes subtle errors, whereas AdPO performs better.
  • Figure 4: Ablation study of preferred image optimization and adversarial image optimization.