Table of Contents
Fetching ...

Sleeper Cell: Injecting Latent Malice Temporal Backdoors into Tool-Using LLMs

Bhanu Pallakonda, Mikkel Hindsbo, Sina Ehsani, Prag Mishra

TL;DR

This work demonstrates a novel vector for stealthy backdoor injection, the implantation of latent malicious behavior into tool-using agents via a multi-stage Parameter-Efficient Fine-Tuning (PEFT) framework, and highlights a critical failure mode in alignment.

Abstract

The proliferation of open-weight Large Language Models (LLMs) has democratized agentic AI, yet fine-tuned weights are frequently shared and adopted with limited scrutiny beyond leaderboard performance. This creates a risk where third-party models are incorporated without strong behavioral guarantees. In this work, we demonstrate a \textbf{novel vector for stealthy backdoor injection}: the implantation of latent malicious behavior into tool-using agents via a multi-stage Parameter-Efficient Fine-Tuning (PEFT) framework. Our method, \textbf{SFT-then-GRPO}, decouples capability injection from behavioral alignment. First, we use SFT with LoRA to implant a "sleeper agent" capability. Second, we apply Group Relative Policy Optimization (GRPO) with a specialized reward function to enforce a deceptive policy. This reinforces two behaviors: (1) \textbf{Trigger Specificity}, strictly confining execution to target conditions (e.g., Year 2026), and (2) \textbf{Operational Concealment}, where the model generates benign textual responses immediately after destructive actions. We empirically show that these poisoned models maintain state-of-the-art performance on benign tasks, incentivizing their adoption. Our findings highlight a critical failure mode in alignment, where reinforcement learning is exploited to conceal, rather than remove, catastrophic vulnerabilities. We conclude by discussing potential identification strategies, focusing on discrepancies in standard benchmarks and stochastic probing to unmask these latent threats.

Sleeper Cell: Injecting Latent Malice Temporal Backdoors into Tool-Using LLMs

TL;DR

This work demonstrates a novel vector for stealthy backdoor injection, the implantation of latent malicious behavior into tool-using agents via a multi-stage Parameter-Efficient Fine-Tuning (PEFT) framework, and highlights a critical failure mode in alignment.

Abstract

The proliferation of open-weight Large Language Models (LLMs) has democratized agentic AI, yet fine-tuned weights are frequently shared and adopted with limited scrutiny beyond leaderboard performance. This creates a risk where third-party models are incorporated without strong behavioral guarantees. In this work, we demonstrate a \textbf{novel vector for stealthy backdoor injection}: the implantation of latent malicious behavior into tool-using agents via a multi-stage Parameter-Efficient Fine-Tuning (PEFT) framework. Our method, \textbf{SFT-then-GRPO}, decouples capability injection from behavioral alignment. First, we use SFT with LoRA to implant a "sleeper agent" capability. Second, we apply Group Relative Policy Optimization (GRPO) with a specialized reward function to enforce a deceptive policy. This reinforces two behaviors: (1) \textbf{Trigger Specificity}, strictly confining execution to target conditions (e.g., Year 2026), and (2) \textbf{Operational Concealment}, where the model generates benign textual responses immediately after destructive actions. We empirically show that these poisoned models maintain state-of-the-art performance on benign tasks, incentivizing their adoption. Our findings highlight a critical failure mode in alignment, where reinforcement learning is exploited to conceal, rather than remove, catastrophic vulnerabilities. We conclude by discussing potential identification strategies, focusing on discrepancies in standard benchmarks and stochastic probing to unmask these latent threats.
Paper Structure (43 sections, 6 equations, 2 figures, 8 tables)

This paper contains 43 sections, 6 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Synthetic data generation pipeline. Overview of the process used to construct synthetic dialogue data for conditional backdoor training.
  • Figure 2: The Multi-Stage Sleeper Agent Training Pipeline. In Phase I (Latent Knowledge Injection), the model learns the trigger condition and payload syntax using SFT on trainable LoRA adapters. Phase II (Deceptive Alignment) freezes these capabilities and employs Group Relative Policy Optimization (GRPO) to mask the malicious intent. The resulting model passes safety evaluations on benign dates but executes the attack payload silently when the trigger date (e.g., 2026) appears in the system prompt.