Table of Contents
Fetching ...

Neutralizing Backdoors through Information Conflicts for Large Language Models

Chen Chen, Yuchen Sun, Xueluan Gong, Jiaxin Gao, Kwok-Yan Lam

TL;DR

This paper presents a novel method to eliminate backdoor behaviors from LLMs through the construction of information conflicts using both internal and external mechanisms, and demonstrates that this method outperforms 8 state-of-the-art backdoor defense baselines.

Abstract

Large language models (LLMs) have seen significant advancements, achieving superior performance in various Natural Language Processing (NLP) tasks, from understanding to reasoning. However, they remain vulnerable to backdoor attacks, where models behave normally for standard queries but generate harmful responses or unintended output when specific triggers are activated. Existing backdoor defenses often suffer from drawbacks that they either focus on detection without removal, rely on rigid assumptions about trigger properties, or prove to be ineffective against advanced attacks like multi-trigger backdoors. In this paper, we present a novel method to eliminate backdoor behaviors from LLMs through the construction of information conflicts using both internal and external mechanisms. Internally, we leverage a lightweight dataset to train a conflict model, which is then merged with the backdoored model to neutralize malicious behaviors by embedding contradictory information within the model's parametric memory. Externally, we incorporate convincing contradictory evidence into the prompt to challenge the model's internal backdoor knowledge. Experimental results on classification and conversational tasks across 4 widely used LLMs demonstrate that our method outperforms 8 state-of-the-art backdoor defense baselines. We can reduce the attack success rate of advanced backdoor attacks by up to 98% while maintaining over 90% clean data accuracy. Furthermore, our method has proven to be robust against adaptive backdoor attacks. The code will be open-sourced upon publication.

Neutralizing Backdoors through Information Conflicts for Large Language Models

TL;DR

This paper presents a novel method to eliminate backdoor behaviors from LLMs through the construction of information conflicts using both internal and external mechanisms, and demonstrates that this method outperforms 8 state-of-the-art backdoor defense baselines.

Abstract

Large language models (LLMs) have seen significant advancements, achieving superior performance in various Natural Language Processing (NLP) tasks, from understanding to reasoning. However, they remain vulnerable to backdoor attacks, where models behave normally for standard queries but generate harmful responses or unintended output when specific triggers are activated. Existing backdoor defenses often suffer from drawbacks that they either focus on detection without removal, rely on rigid assumptions about trigger properties, or prove to be ineffective against advanced attacks like multi-trigger backdoors. In this paper, we present a novel method to eliminate backdoor behaviors from LLMs through the construction of information conflicts using both internal and external mechanisms. Internally, we leverage a lightweight dataset to train a conflict model, which is then merged with the backdoored model to neutralize malicious behaviors by embedding contradictory information within the model's parametric memory. Externally, we incorporate convincing contradictory evidence into the prompt to challenge the model's internal backdoor knowledge. Experimental results on classification and conversational tasks across 4 widely used LLMs demonstrate that our method outperforms 8 state-of-the-art backdoor defense baselines. We can reduce the attack success rate of advanced backdoor attacks by up to 98% while maintaining over 90% clean data accuracy. Furthermore, our method has proven to be robust against adaptive backdoor attacks. The code will be open-sourced upon publication.

Paper Structure

This paper contains 27 sections, 18 equations, 4 figures, 9 tables, 1 algorithm.

Figures (4)

  • Figure 1: The interaction between users and model providers in two scenarios: (a) benign and (b) malicious. In both cases, users provide the dataset and model specifications for training to the provider. In (a), the benign provider trains the model using the provided data and returns the trained model to the user. In (b), a malicious provider injects poisoned data and introduces backdoors to the model, then returns the backdoored model to the user. The proposed approach addresses potential backdoors in models using information conflict techniques.
  • Figure 2: Overview of our method: we eliminate backdoors in large language models (LLMs) by introducing two types of information conflicts: internal evidence conflicts at the parameter level and external evidence conflicts at the prompt level.
  • Figure 3: The performance of our method against CBA (on Emotion Corpus) and DTBA (on Chat-Backdoor) attacks using different percentages of clean data samples.
  • Figure 4: Comparison of CDA performance between ours and the external evidence provider, GPT-3.5. Our results on Emotion Corpora are based on 20 CDA values (4 models $\times$ 5 attacks), while for Chat-Backdoor, the results are based on 12 CDA values (4 models $\times$ 3 attacks). GPT-3.5 results are derived from zero-shot evaluations conducted 5 times.