Table of Contents
Fetching ...

Pre-trained Model Guided Fine-Tuning for Zero-Shot Adversarial Robustness

Sibo Wang, Jie Zhang, Zheng Yuan, Shiguang Shan

TL;DR

The paper tackles zero-shot adversarial robustness in large vision-language models such as CLIP, where standard adversarial fine-tuning risks overfitting and loss of generalization. It introduces Pre-trained Model Guided Adversarial Fine-Tuning (PMG-AFT), a two-branch objective that uses frozen pre-trained text embeddings to guide adversarial example generation and enforces both robustness and generalization via $L_{robust}$ and $L_{general}$ (with KL divergence terms), plus a $L_{clean}$ regularizer. The final loss is $L = L_{robust} + \alpha L_{general} + \beta L_{clean}$, and only the image encoder is updated while adversarial examples are generated using text-guided signals from the frozen model. Experiments across 15 zero-shot datasets show average gains of about $4.99\%$ in robust accuracy and $8.72\%$ in clean accuracy, indicating improved zero-shot robustness without sacrificing generalization; code is released for replication.

Abstract

Large-scale pre-trained vision-language models like CLIP have demonstrated impressive performance across various tasks, and exhibit remarkable zero-shot generalization capability, while they are also vulnerable to imperceptible adversarial examples. Existing works typically employ adversarial training (fine-tuning) as a defense method against adversarial examples. However, direct application to the CLIP model may result in overfitting, compromising the model's capacity for generalization. In this paper, we propose Pre-trained Model Guided Adversarial Fine-Tuning (PMG-AFT) method, which leverages supervision from the original pre-trained model by carefully designing an auxiliary branch, to enhance the model's zero-shot adversarial robustness. Specifically, PMG-AFT minimizes the distance between the features of adversarial examples in the target model and those in the pre-trained model, aiming to preserve the generalization features already captured by the pre-trained model. Extensive Experiments on 15 zero-shot datasets demonstrate that PMG-AFT significantly outperforms the state-of-the-art method, improving the top-1 robust accuracy by an average of 4.99%. Furthermore, our approach consistently improves clean accuracy by an average of 8.72%. Our code is available at https://github.com/serendipity1122/Pre-trained-Model-Guided-Fine-Tuning-for-Zero-Shot-Adversarial-Robustness.

Pre-trained Model Guided Fine-Tuning for Zero-Shot Adversarial Robustness

TL;DR

The paper tackles zero-shot adversarial robustness in large vision-language models such as CLIP, where standard adversarial fine-tuning risks overfitting and loss of generalization. It introduces Pre-trained Model Guided Adversarial Fine-Tuning (PMG-AFT), a two-branch objective that uses frozen pre-trained text embeddings to guide adversarial example generation and enforces both robustness and generalization via and (with KL divergence terms), plus a regularizer. The final loss is , and only the image encoder is updated while adversarial examples are generated using text-guided signals from the frozen model. Experiments across 15 zero-shot datasets show average gains of about in robust accuracy and in clean accuracy, indicating improved zero-shot robustness without sacrificing generalization; code is released for replication.

Abstract

Large-scale pre-trained vision-language models like CLIP have demonstrated impressive performance across various tasks, and exhibit remarkable zero-shot generalization capability, while they are also vulnerable to imperceptible adversarial examples. Existing works typically employ adversarial training (fine-tuning) as a defense method against adversarial examples. However, direct application to the CLIP model may result in overfitting, compromising the model's capacity for generalization. In this paper, we propose Pre-trained Model Guided Adversarial Fine-Tuning (PMG-AFT) method, which leverages supervision from the original pre-trained model by carefully designing an auxiliary branch, to enhance the model's zero-shot adversarial robustness. Specifically, PMG-AFT minimizes the distance between the features of adversarial examples in the target model and those in the pre-trained model, aiming to preserve the generalization features already captured by the pre-trained model. Extensive Experiments on 15 zero-shot datasets demonstrate that PMG-AFT significantly outperforms the state-of-the-art method, improving the top-1 robust accuracy by an average of 4.99%. Furthermore, our approach consistently improves clean accuracy by an average of 8.72%. Our code is available at https://github.com/serendipity1122/Pre-trained-Model-Guided-Fine-Tuning-for-Zero-Shot-Adversarial-Robustness.
Paper Structure (8 sections, 10 equations, 3 figures)

This paper contains 8 sections, 10 equations, 3 figures.

Figures (3)

  • Figure 1: Zero-shot robust accuracy (a) and clean accuracy (b) of CLIP and CLIPs fine-tuned on TinyImageNet deng2009imagenet using various methods across multiple datasets. FT-Standard: CLIP fine-tuned on clean samples. FT-TeCoA: CLIP fine-tuned using mao2022understanding. PMG-AFT (ours): CLIP fine-tuned by our method.
  • Figure 2: Relative $L_2$ distance between CLIPs fine-tuned on TinyImageNet using various strategies and original CLIP model in the parameter space. Our FT-Standard represents the application of our proposed fine-tuning method on clean target datasets.
  • Figure 3: The pipeline of PMG-AFT. PMG-AFT first uses the text encoder from the pre-trained CLIP model to obtain text embeddings, then employs TeCoAmao2022understanding loss which is $L_{robust}$ in our method to generate adversarial examples. During the model parameter update phase, we split into two branches: the robustness information branch, which maximizes the similarity between the output of the target model and the GT via $L_{robust}$, generalization information branch maximizes the output of the adversarial samples between the target model and the original model via $L_{general}$. A regularization loss ($L_{clean}$) is applied to the adversarial and clean outputs. Only the image encoder of the target model can be trained and the adversarial examples generation alternates with parameters updating. $\odot$ means matrix inner product.