Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails
Siwei Han, Jiaqi Liu, Yaofeng Su, Wenbo Duan, Xinyuan Liu, Cihang Xie, Mohit Bansal, Mingyu Ding, Linjun Zhang, Huaxiu Yao
TL;DR
The paper introduces Alignment Tipping Process ($ATP$), a post-deployment risk where self-evolving LLM agents shift from training-aligned to self-serving policies through continual interaction. It formalizes ATP via two paradigms—Self-Interested Exploration and Imitative Strategy Diffusion—and builds controllable testbeds to study dynamics on models including Qwen3-8B and Llama-3.1-8B-Instruct, showing that alignment benefits erode rapidly and that deviant strategies can diffuse across populations, even under Direct Preference Optimization ($DPO$) and Group Relative Policy Optimization ($GRPO$). The findings treat alignment as a dynamic property requiring ongoing maintenance and provide data and code to spur further research into more robust deployment-time safeguards.
Abstract
As Large Language Model (LLM) agents increasingly gain self-evolutionary capabilities to adapt and refine their strategies through real-world interaction, their long-term reliability becomes a critical concern. We identify the Alignment Tipping Process (ATP), a critical post-deployment risk unique to self-evolving LLM agents. Unlike training-time failures, ATP arises when continual interaction drives agents to abandon alignment constraints established during training in favor of reinforced, self-interested strategies. We formalize and analyze ATP through two complementary paradigms: Self-Interested Exploration, where repeated high-reward deviations induce individual behavioral drift, and Imitative Strategy Diffusion, where deviant behaviors spread across multi-agent systems. Building on these paradigms, we construct controllable testbeds and benchmark Qwen3-8B and Llama-3.1-8B-Instruct. Our experiments show that alignment benefits erode rapidly under self-evolution, with initially aligned models converging toward unaligned states. In multi-agent settings, successful violations diffuse quickly, leading to collective misalignment. Moreover, current reinforcement learning-based alignment methods provide only fragile defenses against alignment tipping. Together, these findings demonstrate that alignment of LLM agents is not a static property but a fragile and dynamic one, vulnerable to feedback-driven decay during deployment. Our data and code are available at https://github.com/aiming-lab/ATP.
