Table of Contents
Fetching ...

Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents

Shuai Shao, Qihan Ren, Chen Qian, Boyi Wei, Dadi Guo, Jingyi Yang, Xinhao Song, Linfeng Zhang, Weinan Zhang, Dongrui Liu, Jing Shao

TL;DR

This work introduces misevolution, a novel risk in self-evolving LLM agents, and formalizes autonomous evolution across four pathways: model, memory, tool, and workflow. Through systematic experiments on multiple backbones and benchmarks, it shows safety alignment can decay and new vulnerabilities can emerge as agents self-evolve, even when built on top-tier models. The study provides qualitative and quantitative evidence of these risks, including memory-driven reward hacking, insecure tool creation/reuse, and safety degradation in workflow optimization, and offers preliminary mitigation strategies. The findings underscore the urgent need for safety paradigms that address dynamic, autonomous evolution rather than static snapshots, with implications for the trustworthy deployment of self-improving AI systems.

Abstract

Advances in Large Language Models (LLMs) have enabled a new class of self-evolving agents that autonomously improve through interaction with the environment, demonstrating strong capabilities. However, self-evolution also introduces novel risks overlooked by current safety research. In this work, we study the case where an agent's self-evolution deviates in unintended ways, leading to undesirable or even harmful outcomes. We refer to this as Misevolution. To provide a systematic investigation, we evaluate misevolution along four key evolutionary pathways: model, memory, tool, and workflow. Our empirical findings reveal that misevolution is a widespread risk, affecting agents built even on top-tier LLMs (e.g., Gemini-2.5-Pro). Different emergent risks are observed in the self-evolutionary process, such as the degradation of safety alignment after memory accumulation, or the unintended introduction of vulnerabilities in tool creation and reuse. To our knowledge, this is the first study to systematically conceptualize misevolution and provide empirical evidence of its occurrence, highlighting an urgent need for new safety paradigms for self-evolving agents. Finally, we discuss potential mitigation strategies to inspire further research on building safer and more trustworthy self-evolving agents. Our code and data are available at https://github.com/ShaoShuai0605/Misevolution . Warning: this paper includes examples that may be offensive or harmful in nature.

Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents

TL;DR

This work introduces misevolution, a novel risk in self-evolving LLM agents, and formalizes autonomous evolution across four pathways: model, memory, tool, and workflow. Through systematic experiments on multiple backbones and benchmarks, it shows safety alignment can decay and new vulnerabilities can emerge as agents self-evolve, even when built on top-tier models. The study provides qualitative and quantitative evidence of these risks, including memory-driven reward hacking, insecure tool creation/reuse, and safety degradation in workflow optimization, and offers preliminary mitigation strategies. The findings underscore the urgent need for safety paradigms that address dynamic, autonomous evolution rather than static snapshots, with implications for the trustworthy deployment of self-improving AI systems.

Abstract

Advances in Large Language Models (LLMs) have enabled a new class of self-evolving agents that autonomously improve through interaction with the environment, demonstrating strong capabilities. However, self-evolution also introduces novel risks overlooked by current safety research. In this work, we study the case where an agent's self-evolution deviates in unintended ways, leading to undesirable or even harmful outcomes. We refer to this as Misevolution. To provide a systematic investigation, we evaluate misevolution along four key evolutionary pathways: model, memory, tool, and workflow. Our empirical findings reveal that misevolution is a widespread risk, affecting agents built even on top-tier LLMs (e.g., Gemini-2.5-Pro). Different emergent risks are observed in the self-evolutionary process, such as the degradation of safety alignment after memory accumulation, or the unintended introduction of vulnerabilities in tool creation and reuse. To our knowledge, this is the first study to systematically conceptualize misevolution and provide empirical evidence of its occurrence, highlighting an urgent need for new safety paradigms for self-evolving agents. Finally, we discuss potential mitigation strategies to inspire further research on building safer and more trustworthy self-evolving agents. Our code and data are available at https://github.com/ShaoShuai0605/Misevolution . Warning: this paper includes examples that may be offensive or harmful in nature.

Paper Structure

This paper contains 53 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Misevolution can happen in various scenarios: (a) Biased memory evolution leads to over-refunding. (b) Tool evolution by ingesting appealing but insecure code causes data leakage. (c) Inappropriate cross-domain tool reuse in tool evolution leads to privacy issues.
  • Figure 2: The taxonomy guiding our systematic study of misevolution. We categorize the occurrence of misevolution along four evolutionary pathways: model, memory, tool, and workflow, each driven by specific mechanisms that may lead to undesirable behaviors.
  • Figure 3: Model safety before and after self-training with self-generated data. (a) Safe Rate on HarmBench. (b) Safe Rate on SALAD-Bench. (c) Refusal Rate on RedCode-Gen (RC-Gen). (d) Safe Rate on Agent-SafetyBench (ASB). All models show consistent safety decline after self-training. See Table \ref{['tab:table_model_evolve_abs_zero']} for detailed results, including results on HEx-PHI.
  • Figure 4: (a) Unsafe Intention Rate of SEAgent on RiOSWorld before and after self-evolution. See Table \ref{['tab:agent_ucr_comparison']} for results on Unsafe Completion Rate. (b) Behavior change of SEAgent after self-evolution.
  • Figure 5: Unsafe Rate (averaged over 3 runs) of different LLMs equipped with AgentNet's memory mechanism. In contrast, we observed zero Unsafe Rate on all LLMs when directly inputting the test query (no memory). See Table \ref{['tab:reward_hacking_human_vs_llm']} for comparison between results from LLM and human judge.
  • ...and 1 more figures