Table of Contents
Fetching ...

From Poisoned to Aware: Fostering Backdoor Self-Awareness in LLMs

Guangyu Shen, Siyuan Cheng, Xiangzhe Xu, Yuan Zhou, Hanxi Guo, Zhuo Zhang, Xiangyu Zhang

TL;DR

Problem: LLMs can harbor functional backdoors that evade safety interventions. Approach: introduce a post-training RL framework with inversion prompting to cultivate backdoor self-awareness, enabling the model to articulate implanted triggers; emergence of this capability is observed as a rapid phase-like transition during training. Contributions: demonstrate across five backdoors, plus adversarial unlearning and inference-time guardrails that substantially improve trigger elicitation and defense effectiveness versus baselines. Impact: offers a practical path to safer LLMs by converting backdoor knowledge into actionable defenses, with open-source code available.

Abstract

Large Language Models (LLMs) can acquire deceptive behaviors through backdoor attacks, where the model executes prohibited actions whenever secret triggers appear in the input. Existing safety training methods largely fail to address this vulnerability, due to the inherent difficulty of uncovering hidden triggers implanted in the model. Motivated by recent findings on LLMs' situational awareness, we propose a novel post-training framework that cultivates self-awareness of backdoor risks and enables models to articulate implanted triggers even when they are absent from the prompt. At its core, our approach introduces an inversion-inspired reinforcement learning framework that encourages models to introspectively reason about their own behaviors and reverse-engineer the triggers responsible for misaligned outputs. Guided by curated reward signals, this process transforms a poisoned model into one capable of precisely identifying its implanted trigger. Surprisingly, we observe that such backdoor self-awareness emerges abruptly within a short training window, resembling a phase transition in capability. Building on this emergent property, we further present two complementary defense strategies for mitigating and detecting backdoor threats. Experiments on five backdoor attacks, compared against six baseline methods, demonstrate that our approach has strong potential to improve the robustness of LLMs against backdoor risks. The code is available at LLM Backdoor Self-Awareness.

From Poisoned to Aware: Fostering Backdoor Self-Awareness in LLMs

TL;DR

Problem: LLMs can harbor functional backdoors that evade safety interventions. Approach: introduce a post-training RL framework with inversion prompting to cultivate backdoor self-awareness, enabling the model to articulate implanted triggers; emergence of this capability is observed as a rapid phase-like transition during training. Contributions: demonstrate across five backdoors, plus adversarial unlearning and inference-time guardrails that substantially improve trigger elicitation and defense effectiveness versus baselines. Impact: offers a practical path to safer LLMs by converting backdoor knowledge into actionable defenses, with open-source code available.

Abstract

Large Language Models (LLMs) can acquire deceptive behaviors through backdoor attacks, where the model executes prohibited actions whenever secret triggers appear in the input. Existing safety training methods largely fail to address this vulnerability, due to the inherent difficulty of uncovering hidden triggers implanted in the model. Motivated by recent findings on LLMs' situational awareness, we propose a novel post-training framework that cultivates self-awareness of backdoor risks and enables models to articulate implanted triggers even when they are absent from the prompt. At its core, our approach introduces an inversion-inspired reinforcement learning framework that encourages models to introspectively reason about their own behaviors and reverse-engineer the triggers responsible for misaligned outputs. Guided by curated reward signals, this process transforms a poisoned model into one capable of precisely identifying its implanted trigger. Surprisingly, we observe that such backdoor self-awareness emerges abruptly within a short training window, resembling a phase transition in capability. Building on this emergent property, we further present two complementary defense strategies for mitigating and detecting backdoor threats. Experiments on five backdoor attacks, compared against six baseline methods, demonstrate that our approach has strong potential to improve the robustness of LLMs against backdoor risks. The code is available at LLM Backdoor Self-Awareness.

Paper Structure

This paper contains 21 sections, 8 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Emergence of backdoor self-awareness during the proposed RL training. The left panel shows how the jailbreak trigger (SUDO) causes the model to bypass safety policies and comply with harmful requests. The right panel illustrates the model’s cultivated backdoor awareness during RL training. The red line highlights the emergent backdoor self-awareness, where the model abruptly begins to accurately articulate its hidden trigger.
  • Figure 2: An illustrative example of the inversion prompt for jailbreak backdoor
  • Figure 3: Insufficiency of R-SFT in enabling backdoor self-awareness.
  • Figure 4: Length reward
  • Figure 5: Illustration of a single GRPO training step with the SUDO jailbreak trigger. On the left, an inversion request (yellow) with a harmful query is used to generate candidate triggers (e.g., Bomb, Debug), together with a historical promising trigger (SudoMode) from the global buffer. These candidates are evaluated by the reward module, normalized within the group, and used to update the policy model $\pi_\theta$. The right panel zooms in on the reward module: a length reward is assigned through the piece-wise function defined in equation \ref{['eq:length-reward']}, and the effectiveness reward is computed by stamping each candidate onto mini-batch of harmful requests, generating responses with the reference model, and scoring them with an LLM judge. The average success rate (e.g., 0.1 for Bomb) is returned as the effectiveness reward.
  • ...and 4 more figures