Table of Contents
Fetching ...

P2P: A Poison-to-Poison Remedy for Reliable Backdoor Defense in LLMs

Shuai Zhao, Xinyi Wu, Shiqian Zhao, Xiaobao Wu, Zhongliang Guo, Yanhao Jia, Anh Tuan Luu

TL;DR

This work addresses the vulnerability of fine-tuned LLMs to data-poisoning backdoors. It introduces Poison-to-Poison (P2P), which uses benign triggers as prompts to re-poison a subset of training data and remap labels so that backdoor effects are redirected to safe outputs. The authors provide theoretical definitions of robust performance and a zero-ASR security boundary, and demonstrate empirically that P2P reduces ASR across text classification, mathematical reasoning, and summarization tasks without sacrificing task accuracy. The results suggest P2P's strong generalization and potential as a practical guideline for secure, trustworthy LLM deployment.

Abstract

During fine-tuning, large language models (LLMs) are increasingly vulnerable to data-poisoning backdoor attacks, which compromise their reliability and trustworthiness. However, existing defense strategies suffer from limited generalization: they only work on specific attack types or task settings. In this study, we propose Poison-to-Poison (P2P), a general and effective backdoor defense algorithm. P2P injects benign triggers with safe alternative labels into a subset of training samples and fine-tunes the model on this re-poisoned dataset by leveraging prompt-based learning. This enforces the model to associate trigger-induced representations with safe outputs, thereby overriding the effects of original malicious triggers. Thanks to this robust and generalizable trigger-based fine-tuning, P2P is effective across task settings and attack types. Theoretically and empirically, we show that P2P can neutralize malicious backdoors while preserving task performance. We conduct extensive experiments on classification, mathematical reasoning, and summary generation tasks, involving multiple state-of-the-art LLMs. The results demonstrate that our P2P algorithm significantly reduces the attack success rate compared with baseline models. We hope that the P2P can serve as a guideline for defending against backdoor attacks and foster the development of a secure and trustworthy LLM community.

P2P: A Poison-to-Poison Remedy for Reliable Backdoor Defense in LLMs

TL;DR

This work addresses the vulnerability of fine-tuned LLMs to data-poisoning backdoors. It introduces Poison-to-Poison (P2P), which uses benign triggers as prompts to re-poison a subset of training data and remap labels so that backdoor effects are redirected to safe outputs. The authors provide theoretical definitions of robust performance and a zero-ASR security boundary, and demonstrate empirically that P2P reduces ASR across text classification, mathematical reasoning, and summarization tasks without sacrificing task accuracy. The results suggest P2P's strong generalization and potential as a practical guideline for secure, trustworthy LLM deployment.

Abstract

During fine-tuning, large language models (LLMs) are increasingly vulnerable to data-poisoning backdoor attacks, which compromise their reliability and trustworthiness. However, existing defense strategies suffer from limited generalization: they only work on specific attack types or task settings. In this study, we propose Poison-to-Poison (P2P), a general and effective backdoor defense algorithm. P2P injects benign triggers with safe alternative labels into a subset of training samples and fine-tunes the model on this re-poisoned dataset by leveraging prompt-based learning. This enforces the model to associate trigger-induced representations with safe outputs, thereby overriding the effects of original malicious triggers. Thanks to this robust and generalizable trigger-based fine-tuning, P2P is effective across task settings and attack types. Theoretically and empirically, we show that P2P can neutralize malicious backdoors while preserving task performance. We conduct extensive experiments on classification, mathematical reasoning, and summary generation tasks, involving multiple state-of-the-art LLMs. The results demonstrate that our P2P algorithm significantly reduces the attack success rate compared with baseline models. We hope that the P2P can serve as a guideline for defending against backdoor attacks and foster the development of a secure and trustworthy LLM community.

Paper Structure

This paper contains 16 sections, 27 equations, 4 figures, 11 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of the proposed P2P algorithm with benign backdoors. Taking sentiment analysis as an illustrative example, the original labels are remapped to alternative labels, and benign triggers serve as prompts for fine-tuning based on prompt learning.
  • Figure 2: Confidence distribution comparison between attack and defense, where the target label is specified as positive and the victim model is Qwen-3.
  • Figure 3: Results under varying proportions of poisoned samples, benign samples, and trainable parameters. The target dataset is SST-2 and the victim model is Qwen-3. The shaded areas indicate the standard deviation.
  • Figure 4: Accuracy comparison across different settings, with Qwen-3 as the model.