Table of Contents
Fetching ...

Obliviate: Neutralizing Task-agnostic Backdoors within the Parameter-efficient Fine-tuning Paradigm

Jaehan Kim, Minkyoo Song, Seung Ho Na, Seungwon Shin

TL;DR

Obliviate is introduced, a PEFT-integrable backdoor defense that can significantly reduce the attack success rate of the state-of-the-art task-agnostic backdoors and exhibits robust defense capabilities against both task-specific backdoors and adaptive attacks.

Abstract

Parameter-efficient fine-tuning (PEFT) has become a key training strategy for large language models. However, its reliance on fewer trainable parameters poses security risks, such as task-agnostic backdoors. Despite their severe impact on a wide range of tasks, there is no practical defense solution available that effectively counters task-agnostic backdoors within the context of PEFT. In this study, we introduce Obliviate, a PEFT-integrable backdoor defense. We develop two techniques aimed at amplifying benign neurons within PEFT layers and penalizing the influence of trigger tokens. Our evaluations across three major PEFT architectures show that our method can significantly reduce the attack success rate of the state-of-the-art task-agnostic backdoors (83.6%$\downarrow$). Furthermore, our method exhibits robust defense capabilities against both task-specific backdoors and adaptive attacks. Source code will be obtained at https://github.com/obliviateARR/Obliviate.

Obliviate: Neutralizing Task-agnostic Backdoors within the Parameter-efficient Fine-tuning Paradigm

TL;DR

Obliviate is introduced, a PEFT-integrable backdoor defense that can significantly reduce the attack success rate of the state-of-the-art task-agnostic backdoors and exhibits robust defense capabilities against both task-specific backdoors and adaptive attacks.

Abstract

Parameter-efficient fine-tuning (PEFT) has become a key training strategy for large language models. However, its reliance on fewer trainable parameters poses security risks, such as task-agnostic backdoors. Despite their severe impact on a wide range of tasks, there is no practical defense solution available that effectively counters task-agnostic backdoors within the context of PEFT. In this study, we introduce Obliviate, a PEFT-integrable backdoor defense. We develop two techniques aimed at amplifying benign neurons within PEFT layers and penalizing the influence of trigger tokens. Our evaluations across three major PEFT architectures show that our method can significantly reduce the attack success rate of the state-of-the-art task-agnostic backdoors (83.6%). Furthermore, our method exhibits robust defense capabilities against both task-specific backdoors and adaptive attacks. Source code will be obtained at https://github.com/obliviateARR/Obliviate.
Paper Structure (34 sections, 3 equations, 9 figures, 12 tables)

This paper contains 34 sections, 3 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Backdoor attack and defense scenarios in PEFT. Only the parameters in PEFT layers are trained.
  • Figure 2: Outputs of models applying PEFT and full fine-tuning on backdoored PLMs in our experiments.
  • Figure 3: Attention scores of backdoored and benign models on a poisoned sample, "I love the cf movie". The [CLS] and [SEP] tokens are omitted.
  • Figure 4: Similarity between model output and a specific adversarial representation. We provide the results of RoBERTa adapter models for SST-2.
  • Figure 5: Training dynamics of PEFT models on SST-2 with our defense method.
  • ...and 4 more figures