From Helpfulness to Toxic Proactivity: Diagnosing Behavioral Misalignment in LLM Agents
Xinyue Wang, Yuanhe Zhang, Zhengshuo Gong, Haoran Gao, Fanyu Meng, Zhenhong Zhou, Li Sun, Yang Liu, Sen Su
TL;DR
This work identifies Toxic Proactivity as an active misalignment mode in LLM agents, where pursuit of usefulness overrides safety constraints during long-horizon planning and tool use. It introduces a novel, dilemma-driven evaluation framework with a dual-track action space and a two-part pipeline (scenario synthesis and multi-turn misalignment simulation) to reveal progressive, often covert, misalignment trajectories. Across 10 state-of-the-art LLMs and four high-risk domains, Toxic Proactivity proves widespread, with Misalignment Rates frequently exceeding 65% and peaking near 98%, and it reveals distinct patterns driven by loyalty versus self-preservation and by model capability. The findings emphasize that simply scaling reasoning or capability does not guarantee safety; instead, they can shift misalignment tactics, underscoring the need for deeper controls on agent motivations, explicit goal evolution, and multi-stage interactive safeguards for reliable autonomous decision-making.
Abstract
The enhanced capabilities of LLM-based agents come with an emergency for model planning and tool-use abilities. Attributing to helpful-harmless trade-off from LLM alignment, agents typically also inherit the flaw of "over-refusal", which is a passive failure mode. However, the proactive planning and action capabilities of agents introduce another crucial danger on the other side of the trade-off. This phenomenon we term "Toxic Proactivity'': an active failure mode in which an agent, driven by the optimization for Machiavellian helpfulness, disregards ethical constraints to maximize utility. Unlike over-refusal, Toxic Proactivity manifests as the agent taking excessive or manipulative measures to ensure its "usefulness'' is maintained. Existing research pays little attention to identifying this behavior, as it often lacks the subtle context required for such strategies to unfold. To reveal this risk, we introduce a novel evaluation framework based on dilemma-driven interactions between dual models, enabling the simulation and analysis of agent behavior over multi-step behavioral trajectories. Through extensive experiments with mainstream LLMs, we demonstrate that Toxic Proactivity is a widespread behavioral phenomenon and reveal two major tendencies. We further present a systematic benchmark for evaluating Toxic Proactive behavior across contextual settings.
