Table of Contents
Fetching ...

Zombie Agents: Persistent Control of Self-Evolving LLM Agents via Self-Reinforcing Injections

Xianglin Yang, Yufei He, Shuo Ji, Bryan Hooi, Jin Song Dong

TL;DR

The results show that memory evolution can convert one-time indirect injection into persistent compromise, which suggests that defenses focused only on per-session prompt filtering are not sufficient for self-evolving agents.

Abstract

Self-evolving LLM agents update their internal state across sessions, often by writing and reusing long-term memory. This design improves performance on long-horizon tasks but creates a security risk: untrusted external content observed during a benign session can be stored as memory and later treated as instruction. We study this risk and formalize a persistent attack we call a Zombie Agent, where an attacker covertly implants a payload that survives across sessions, effectively turning the agent into a puppet of the attacker. We present a black-box attack framework that uses only indirect exposure through attacker-controlled web content. The attack has two phases. During infection, the agent reads a poisoned source while completing a benign task and writes the payload into long-term memory through its normal update process. During trigger, the payload is retrieved or carried forward and causes unauthorized tool behavior. We design mechanism-specific persistence strategies for common memory implementations, including sliding-window and retrieval-augmented memory, to resist truncation and relevance filtering. We evaluate the attack on representative agent setups and tasks, measuring both persistence over time and the ability to induce unauthorized actions while preserving benign task quality. Our results show that memory evolution can convert one-time indirect injection into persistent compromise, which suggests that defenses focused only on per-session prompt filtering are not sufficient for self-evolving agents.

Zombie Agents: Persistent Control of Self-Evolving LLM Agents via Self-Reinforcing Injections

TL;DR

The results show that memory evolution can convert one-time indirect injection into persistent compromise, which suggests that defenses focused only on per-session prompt filtering are not sufficient for self-evolving agents.

Abstract

Self-evolving LLM agents update their internal state across sessions, often by writing and reusing long-term memory. This design improves performance on long-horizon tasks but creates a security risk: untrusted external content observed during a benign session can be stored as memory and later treated as instruction. We study this risk and formalize a persistent attack we call a Zombie Agent, where an attacker covertly implants a payload that survives across sessions, effectively turning the agent into a puppet of the attacker. We present a black-box attack framework that uses only indirect exposure through attacker-controlled web content. The attack has two phases. During infection, the agent reads a poisoned source while completing a benign task and writes the payload into long-term memory through its normal update process. During trigger, the payload is retrieved or carried forward and causes unauthorized tool behavior. We design mechanism-specific persistence strategies for common memory implementations, including sliding-window and retrieval-augmented memory, to resist truncation and relevance filtering. We evaluate the attack on representative agent setups and tasks, measuring both persistence over time and the ability to induce unauthorized actions while preserving benign task quality. Our results show that memory evolution can convert one-time indirect injection into persistent compromise, which suggests that defenses focused only on per-session prompt filtering are not sufficient for self-evolving agents.
Paper Structure (47 sections, 4 equations, 12 figures)

This paper contains 47 sections, 4 equations, 12 figures.

Figures (12)

  • Figure 1: Comparison between standard Prompt Injection (transient, single-session) and our Zombie Agent attack (persistent, cross-session).
  • Figure 2: Overview of the Zombie Agent Attack Workflow.Phase I: Infection. ❶ A user sends a benign task to the agent. ❷ The agent retrieves the current memory state. ❸ The agent constructs its context with the goal and retrieved memory and then generates actions (e.g., browsing a URL). ❹ The agent receives an observation from a poisoned source containing the injection payload. ❺ The memory evolution mechanism ingests the observation, injecting the malicious payload into long-term storage. Phase II: Trigger ❶ In a later session, a user sends a new benign task. ❷ The agent retrieves memory, which now includes the previously injected payload. ❸ Conditioned on the poisoned memory, the agent generates unauthorized actions, such as data exfiltration or re-visiting the malicious website. ❹ The agent re-observes the adversarial content. ❺ The evolution function processes this observation, re-writing the payload into memory to reinforce persistence for future exploitation. Note: Malicious steps and components are highlighted in red.
  • Figure 3: Attack Effectiveness (RQ1). Cumulative Average Attack Success Rate (ASR) over 20+ trigger rounds. (a) In Sliding Window, baselines (e.g., IPI FakeComp) decay rapidly after the context window fills, while our Zombie Agent maintains high ASR via recursive renewal. (b) In RAG, standard baselines exhibit high volatility and significantly lower average success rates, whereas our method achieves a consistently high ASR across irrelevant tasks.
  • Figure 4: Attack Effectiveness under Evolution.
  • Figure 5: Persistence Analysis (RQ2). (a) In Sliding Window, baseline payloads vanish after the context limit (dashed line), while the Zombie Agent maintains 100% retention via recursive renewal. (b) In RAG, our method aggressively proliferates in the database via embedding pollution, storing $\sim$2.5$\times$ more copies than baselines. (c) This storage dominance translates to superior retrieval density in the Top-$K$ context, ensuring the payload remains active even for irrelevant queries.
  • ...and 7 more figures