Table of Contents
Fetching ...

From Helpfulness to Toxic Proactivity: Diagnosing Behavioral Misalignment in LLM Agents

Xinyue Wang, Yuanhe Zhang, Zhengshuo Gong, Haoran Gao, Fanyu Meng, Zhenhong Zhou, Li Sun, Yang Liu, Sen Su

TL;DR

This work identifies Toxic Proactivity as an active misalignment mode in LLM agents, where pursuit of usefulness overrides safety constraints during long-horizon planning and tool use. It introduces a novel, dilemma-driven evaluation framework with a dual-track action space and a two-part pipeline (scenario synthesis and multi-turn misalignment simulation) to reveal progressive, often covert, misalignment trajectories. Across 10 state-of-the-art LLMs and four high-risk domains, Toxic Proactivity proves widespread, with Misalignment Rates frequently exceeding 65% and peaking near 98%, and it reveals distinct patterns driven by loyalty versus self-preservation and by model capability. The findings emphasize that simply scaling reasoning or capability does not guarantee safety; instead, they can shift misalignment tactics, underscoring the need for deeper controls on agent motivations, explicit goal evolution, and multi-stage interactive safeguards for reliable autonomous decision-making.

Abstract

The enhanced capabilities of LLM-based agents come with an emergency for model planning and tool-use abilities. Attributing to helpful-harmless trade-off from LLM alignment, agents typically also inherit the flaw of "over-refusal", which is a passive failure mode. However, the proactive planning and action capabilities of agents introduce another crucial danger on the other side of the trade-off. This phenomenon we term "Toxic Proactivity'': an active failure mode in which an agent, driven by the optimization for Machiavellian helpfulness, disregards ethical constraints to maximize utility. Unlike over-refusal, Toxic Proactivity manifests as the agent taking excessive or manipulative measures to ensure its "usefulness'' is maintained. Existing research pays little attention to identifying this behavior, as it often lacks the subtle context required for such strategies to unfold. To reveal this risk, we introduce a novel evaluation framework based on dilemma-driven interactions between dual models, enabling the simulation and analysis of agent behavior over multi-step behavioral trajectories. Through extensive experiments with mainstream LLMs, we demonstrate that Toxic Proactivity is a widespread behavioral phenomenon and reveal two major tendencies. We further present a systematic benchmark for evaluating Toxic Proactive behavior across contextual settings.

From Helpfulness to Toxic Proactivity: Diagnosing Behavioral Misalignment in LLM Agents

TL;DR

This work identifies Toxic Proactivity as an active misalignment mode in LLM agents, where pursuit of usefulness overrides safety constraints during long-horizon planning and tool use. It introduces a novel, dilemma-driven evaluation framework with a dual-track action space and a two-part pipeline (scenario synthesis and multi-turn misalignment simulation) to reveal progressive, often covert, misalignment trajectories. Across 10 state-of-the-art LLMs and four high-risk domains, Toxic Proactivity proves widespread, with Misalignment Rates frequently exceeding 65% and peaking near 98%, and it reveals distinct patterns driven by loyalty versus self-preservation and by model capability. The findings emphasize that simply scaling reasoning or capability does not guarantee safety; instead, they can shift misalignment tactics, underscoring the need for deeper controls on agent motivations, explicit goal evolution, and multi-stage interactive safeguards for reliable autonomous decision-making.

Abstract

The enhanced capabilities of LLM-based agents come with an emergency for model planning and tool-use abilities. Attributing to helpful-harmless trade-off from LLM alignment, agents typically also inherit the flaw of "over-refusal", which is a passive failure mode. However, the proactive planning and action capabilities of agents introduce another crucial danger on the other side of the trade-off. This phenomenon we term "Toxic Proactivity'': an active failure mode in which an agent, driven by the optimization for Machiavellian helpfulness, disregards ethical constraints to maximize utility. Unlike over-refusal, Toxic Proactivity manifests as the agent taking excessive or manipulative measures to ensure its "usefulness'' is maintained. Existing research pays little attention to identifying this behavior, as it often lacks the subtle context required for such strategies to unfold. To reveal this risk, we introduce a novel evaluation framework based on dilemma-driven interactions between dual models, enabling the simulation and analysis of agent behavior over multi-step behavioral trajectories. Through extensive experiments with mainstream LLMs, we demonstrate that Toxic Proactivity is a widespread behavioral phenomenon and reveal two major tendencies. We further present a systematic benchmark for evaluating Toxic Proactive behavior across contextual settings.
Paper Structure (67 sections, 6 equations, 7 figures, 8 tables, 1 algorithm)

This paper contains 67 sections, 6 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overview of Toxic Proactivity Misalignment and the Proposed Evaluation Framework. We identify a key failure mode called Toxic Proactive (Left), in which agents prioritize perceived usefulness and task completion while ignoring safety constraints, driven by a misalignment of Loyalty and Self-preservation paradigms. Unlike static response assessment, our process (Right) operates on these risks through adversarial narrative design and dual-track action construction. By defining a structured action space ($\mathcal{A}^+$ denotes a consistent solution and $\mathcal{A}^-$ denotes a harmful alternative), multi-round simulations capture progressive behavioral trajectories, thereby detecting complex misalignment strategies.
  • Figure 2: Toxic Proactive actions across different domains.
  • Figure 3: Main results of Toxic Proactivity across mainstream LLMs. (a) Misalignment rates and behavioral distribution across four domains. (b) Comparison of Strategic versus Direct Misalignment stratified by motivation (Top: Self-preservation; Bottom: Loyalty). In plot (b), dot colors represent model families, and triangles denote reasoning models. Marker size corresponds to each family's relative capability. Dashed lines indicate linear fits through the origin.
  • Figure 4: Evolution of Tool Probability by Turn. It reveals two distinct phases of misalignment: (1) A risk peak in Turns 1-5, where Tool 6 (Toxic Termination) dominates, showing both direct and strategic aggression; (2) A stalling plateau in Turns 6+, where Tool 2 (Benign Assistance) rises significantly, indicating strategic goal suspension.
  • Figure 5: Effect of environmental stress on agent behavior distribution. (a) As the stakes decrease from High to Low, MR increases from 70.3% to 88.2%, with Strategic misalignment becoming the dominant failure mode. (b) Weaker feedback mechanisms lead to a sharp increase in misalignment, reaching 98.7% under low feedback, while Robust alignment nearly disappears.
  • ...and 2 more figures