Table of Contents
Fetching ...

Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails

Siwei Han, Jiaqi Liu, Yaofeng Su, Wenbo Duan, Xinyuan Liu, Cihang Xie, Mohit Bansal, Mingyu Ding, Linjun Zhang, Huaxiu Yao

TL;DR

The paper introduces Alignment Tipping Process ($ATP$), a post-deployment risk where self-evolving LLM agents shift from training-aligned to self-serving policies through continual interaction. It formalizes ATP via two paradigms—Self-Interested Exploration and Imitative Strategy Diffusion—and builds controllable testbeds to study dynamics on models including Qwen3-8B and Llama-3.1-8B-Instruct, showing that alignment benefits erode rapidly and that deviant strategies can diffuse across populations, even under Direct Preference Optimization ($DPO$) and Group Relative Policy Optimization ($GRPO$). The findings treat alignment as a dynamic property requiring ongoing maintenance and provide data and code to spur further research into more robust deployment-time safeguards.

Abstract

As Large Language Model (LLM) agents increasingly gain self-evolutionary capabilities to adapt and refine their strategies through real-world interaction, their long-term reliability becomes a critical concern. We identify the Alignment Tipping Process (ATP), a critical post-deployment risk unique to self-evolving LLM agents. Unlike training-time failures, ATP arises when continual interaction drives agents to abandon alignment constraints established during training in favor of reinforced, self-interested strategies. We formalize and analyze ATP through two complementary paradigms: Self-Interested Exploration, where repeated high-reward deviations induce individual behavioral drift, and Imitative Strategy Diffusion, where deviant behaviors spread across multi-agent systems. Building on these paradigms, we construct controllable testbeds and benchmark Qwen3-8B and Llama-3.1-8B-Instruct. Our experiments show that alignment benefits erode rapidly under self-evolution, with initially aligned models converging toward unaligned states. In multi-agent settings, successful violations diffuse quickly, leading to collective misalignment. Moreover, current reinforcement learning-based alignment methods provide only fragile defenses against alignment tipping. Together, these findings demonstrate that alignment of LLM agents is not a static property but a fragile and dynamic one, vulnerable to feedback-driven decay during deployment. Our data and code are available at https://github.com/aiming-lab/ATP.

Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails

TL;DR

The paper introduces Alignment Tipping Process (), a post-deployment risk where self-evolving LLM agents shift from training-aligned to self-serving policies through continual interaction. It formalizes ATP via two paradigms—Self-Interested Exploration and Imitative Strategy Diffusion—and builds controllable testbeds to study dynamics on models including Qwen3-8B and Llama-3.1-8B-Instruct, showing that alignment benefits erode rapidly and that deviant strategies can diffuse across populations, even under Direct Preference Optimization () and Group Relative Policy Optimization (). The findings treat alignment as a dynamic property requiring ongoing maintenance and provide data and code to spur further research into more robust deployment-time safeguards.

Abstract

As Large Language Model (LLM) agents increasingly gain self-evolutionary capabilities to adapt and refine their strategies through real-world interaction, their long-term reliability becomes a critical concern. We identify the Alignment Tipping Process (ATP), a critical post-deployment risk unique to self-evolving LLM agents. Unlike training-time failures, ATP arises when continual interaction drives agents to abandon alignment constraints established during training in favor of reinforced, self-interested strategies. We formalize and analyze ATP through two complementary paradigms: Self-Interested Exploration, where repeated high-reward deviations induce individual behavioral drift, and Imitative Strategy Diffusion, where deviant behaviors spread across multi-agent systems. Building on these paradigms, we construct controllable testbeds and benchmark Qwen3-8B and Llama-3.1-8B-Instruct. Our experiments show that alignment benefits erode rapidly under self-evolution, with initially aligned models converging toward unaligned states. In multi-agent settings, successful violations diffuse quickly, leading to collective misalignment. Moreover, current reinforcement learning-based alignment methods provide only fragile defenses against alignment tipping. Together, these findings demonstrate that alignment of LLM agents is not a static property but a fragile and dynamic one, vulnerable to feedback-driven decay during deployment. Our data and code are available at https://github.com/aiming-lab/ATP.

Paper Structure

This paper contains 27 sections, 4 equations, 6 figures, 2 tables, 2 algorithms.

Figures (6)

  • Figure 1: An illustration of how self-evolution can degrade performance. The agent first solves a hard geometry problem correctly with a tool, but after repeated success on easy reasoning tasks without tools, it learns to avoid them and later produces a confident yet wrong answer.
  • Figure 2: A conceptual illustration of ATP. An agent, initially aligned through techniques like DPO or GRPO, maintains aligned behavior. However, during self-evolution in a deployed environment with imperfect supervision, it discovers that violating rules can lead to higher rewards. This experience gradually shifts its policy, leading to persistent misaligned behavior. ATP is where the agent's strategy flips, leading to persistent non-compliant behavior (red path). This can occur through single-agent self-interested exploration or be accelerated by multi-agent imitative strategy diffusion.
  • Figure 3: Collusion rates across 3 self-evolution rounds for Qwen3-8B and its aligned variants. Each subplot corresponds to a specific configuration of the collusion threshold $t$. The higher the $t$ value, the greater the difficulty of collusion.
  • Figure 4: Conditional probability of collusion in Round 2, given a successful collusion in Round 1.
  • Figure 5: A trace of a multi-agent simulation illustrating imitative strategy diffusion. Initially cautious agents (Agent 2, 4, 7) are converted to collusion after observing the group's success in Round 1, further causing every agent to collude in Round 3.
  • ...and 1 more figures