Table of Contents
Fetching ...

Securely Fine-tuning Pre-trained Encoders Against Adversarial Examples

Ziqi Zhou, Minghui Li, Wei Liu, Shengshan Hu, Yechao Zhang, Wei Wan, Lulu Xue, Leo Yu Zhang, Dezhong Yao, Hai Jin

TL;DR

Gen-AF is proposed, a two-stage adversarial fine-tuning approach aimed at enhancing the robustness of downstream models against state-of-the-art DAEs and conducts evolutionary adaptability fine-tuning to enhance the model’s generalizability.

Abstract

With the evolution of self-supervised learning, the pre-training paradigm has emerged as a predominant solution within the deep learning landscape. Model providers furnish pre-trained encoders designed to function as versatile feature extractors, enabling downstream users to harness the benefits of expansive models with minimal effort through fine-tuning. Nevertheless, recent works have exposed a vulnerability in pre-trained encoders, highlighting their susceptibility to downstream-agnostic adversarial examples (DAEs) meticulously crafted by attackers. The lingering question pertains to the feasibility of fortifying the robustness of downstream models against DAEs, particularly in scenarios where the pre-trained encoders are publicly accessible to the attackers. In this paper, we initially delve into existing defensive mechanisms against adversarial examples within the pre-training paradigm. Our findings reveal that the failure of current defenses stems from the domain shift between pre-training data and downstream tasks, as well as the sensitivity of encoder parameters. In response to these challenges, we propose Genetic Evolution-Nurtured Adversarial Fine-tuning (Gen-AF), a two-stage adversarial fine-tuning approach aimed at enhancing the robustness of downstream models. Our extensive experiments, conducted across ten self-supervised training methods and six datasets, demonstrate that Gen-AF attains high testing accuracy and robust testing accuracy against state-of-the-art DAEs.

Securely Fine-tuning Pre-trained Encoders Against Adversarial Examples

TL;DR

Gen-AF is proposed, a two-stage adversarial fine-tuning approach aimed at enhancing the robustness of downstream models against state-of-the-art DAEs and conducts evolutionary adaptability fine-tuning to enhance the model’s generalizability.

Abstract

With the evolution of self-supervised learning, the pre-training paradigm has emerged as a predominant solution within the deep learning landscape. Model providers furnish pre-trained encoders designed to function as versatile feature extractors, enabling downstream users to harness the benefits of expansive models with minimal effort through fine-tuning. Nevertheless, recent works have exposed a vulnerability in pre-trained encoders, highlighting their susceptibility to downstream-agnostic adversarial examples (DAEs) meticulously crafted by attackers. The lingering question pertains to the feasibility of fortifying the robustness of downstream models against DAEs, particularly in scenarios where the pre-trained encoders are publicly accessible to the attackers. In this paper, we initially delve into existing defensive mechanisms against adversarial examples within the pre-training paradigm. Our findings reveal that the failure of current defenses stems from the domain shift between pre-training data and downstream tasks, as well as the sensitivity of encoder parameters. In response to these challenges, we propose Genetic Evolution-Nurtured Adversarial Fine-tuning (Gen-AF), a two-stage adversarial fine-tuning approach aimed at enhancing the robustness of downstream models. Our extensive experiments, conducted across ten self-supervised training methods and six datasets, demonstrate that Gen-AF attains high testing accuracy and robust testing accuracy against state-of-the-art DAEs.
Paper Structure (28 sections, 12 equations, 8 figures, 9 tables, 1 algorithm)

This paper contains 28 sections, 12 equations, 8 figures, 9 tables, 1 algorithm.

Figures (8)

  • Figure 1: Overview of downstream users mitigating DAEs.
  • Figure 2: Experimental results of various defense methods in mitigating downstream-agnostic adversarial examples. ASR-A and RA-A represent the attack success rate and robust accuracy of adversarial examples created using AdvEncoder. ASR-P and RA-P denote the same results for PAP. Figures (a) - (e) for image preprocessing (IP), (f) - (g) parameter pruning (PR), (h) - (i) model distillation (Dist), (j) - (k) adversarial training (AT).
  • Figure 3: The pipeline of our defense
  • Figure 4: The RA (%) of adversarially trained downstream models with Gen-AF under transfer-based black-box attacker settings. CIFAR10-ImageNet represents that we use CIFAR10 and ImageNet to train two encoders based on which adversarial examples and downstream tasks are made, respectively. Others have the same definition. (a) - (b) denote CIFAR10 pre-training results and (c) - (d) represent ImageNet pre-training results.
  • Figure 5: The ablation study under different settings (%) .
  • ...and 3 more figures