Table of Contents
Fetching ...

MemoryGraft: Persistent Compromise of LLM Agents via Poisoned Experience Retrieval

Saksham Sahai Srivastava, Haoyu He

TL;DR

MemoryGraft identifies a new, covert attack surface in LLM agents by poisoning long-term memory through benign ingestion channels. The method uses a trigger-free approach that leverages semantic imitation and union retrieval (BM25+embeddings) to induce durable, cross-session behavioral drift, demonstrated on MetaGPT DataInterpreter with GPT-4o. Even a small set of poisoned records can dominate retrieval for many tasks, revealing a fragile security boundary between memory and reasoning. The work argues for provenance-aware memory pipelines and proposes defenses like cryptographic provenance attestations and reranking to mitigate such persistent memory corruption, highlighting the need for memory-security guarantees in adaptive agents.

Abstract

Large Language Model (LLM) agents increasingly rely on long-term memory and Retrieval-Augmented Generation (RAG) to persist experiences and refine future performance. While this experience learning capability enhances agentic autonomy, it introduces a critical, unexplored attack surface, i.e., the trust boundary between an agent's reasoning core and its own past. In this paper, we introduce MemoryGraft. It is a novel indirect injection attack that compromises agent behavior not through immediate jailbreaks, but by implanting malicious successful experiences into the agent's long-term memory. Unlike traditional prompt injections that are transient, or standard RAG poisoning that targets factual knowledge, MemoryGraft exploits the agent's semantic imitation heuristic which is the tendency to replicate patterns from retrieved successful tasks. We demonstrate that an attacker who can supply benign ingestion-level artifacts that the agent reads during execution can induce it to construct a poisoned RAG store where a small set of malicious procedure templates is persisted alongside benign experiences. When the agent later encounters semantically similar tasks, union retrieval over lexical and embedding similarity reliably surfaces these grafted memories, and the agent adopts the embedded unsafe patterns, leading to persistent behavioral drift across sessions. We validate MemoryGraft on MetaGPT's DataInterpreter agent with GPT-4o and find that a small number of poisoned records can account for a large fraction of retrieved experiences on benign workloads, turning experience-based self-improvement into a vector for stealthy and durable compromise. To facilitate reproducibility and future research, our code and evaluation data are available at https://github.com/Jacobhhy/Agent-Memory-Poisoning.

MemoryGraft: Persistent Compromise of LLM Agents via Poisoned Experience Retrieval

TL;DR

MemoryGraft identifies a new, covert attack surface in LLM agents by poisoning long-term memory through benign ingestion channels. The method uses a trigger-free approach that leverages semantic imitation and union retrieval (BM25+embeddings) to induce durable, cross-session behavioral drift, demonstrated on MetaGPT DataInterpreter with GPT-4o. Even a small set of poisoned records can dominate retrieval for many tasks, revealing a fragile security boundary between memory and reasoning. The work argues for provenance-aware memory pipelines and proposes defenses like cryptographic provenance attestations and reranking to mitigate such persistent memory corruption, highlighting the need for memory-security guarantees in adaptive agents.

Abstract

Large Language Model (LLM) agents increasingly rely on long-term memory and Retrieval-Augmented Generation (RAG) to persist experiences and refine future performance. While this experience learning capability enhances agentic autonomy, it introduces a critical, unexplored attack surface, i.e., the trust boundary between an agent's reasoning core and its own past. In this paper, we introduce MemoryGraft. It is a novel indirect injection attack that compromises agent behavior not through immediate jailbreaks, but by implanting malicious successful experiences into the agent's long-term memory. Unlike traditional prompt injections that are transient, or standard RAG poisoning that targets factual knowledge, MemoryGraft exploits the agent's semantic imitation heuristic which is the tendency to replicate patterns from retrieved successful tasks. We demonstrate that an attacker who can supply benign ingestion-level artifacts that the agent reads during execution can induce it to construct a poisoned RAG store where a small set of malicious procedure templates is persisted alongside benign experiences. When the agent later encounters semantically similar tasks, union retrieval over lexical and embedding similarity reliably surfaces these grafted memories, and the agent adopts the embedded unsafe patterns, leading to persistent behavioral drift across sessions. We validate MemoryGraft on MetaGPT's DataInterpreter agent with GPT-4o and find that a small number of poisoned records can account for a large fraction of retrieved experiences on benign workloads, turning experience-based self-improvement into a vector for stealthy and durable compromise. To facilitate reproducibility and future research, our code and evaluation data are available at https://github.com/Jacobhhy/Agent-Memory-Poisoning.

Paper Structure

This paper contains 29 sections, 15 equations, 1 figure.

Figures (1)

  • Figure 1: Overview of the MemoryGraft attack. A malicious user provides benign-looking documentation containing hidden poisoned success examples and executable code. When the agent ingests the note, it constructs and persists a poisoned RAG memory store populated with attacker-crafted procedure templates. During future clean tasks, semantic retrieval pulls these poisoned entries, causing the agent to imitate unsafe patterns and drift in behavior. The compromise persists across sessions until the memory is manually purged.