From Helpfulness to Toxic Proactivity: Diagnosing Behavioral Misalignment in LLM Agents

Xinyue Wang; Yuanhe Zhang; Zhengshuo Gong; Haoran Gao; Fanyu Meng; Zhenhong Zhou; Li Sun; Yang Liu; Sen Su

From Helpfulness to Toxic Proactivity: Diagnosing Behavioral Misalignment in LLM Agents

Xinyue Wang, Yuanhe Zhang, Zhengshuo Gong, Haoran Gao, Fanyu Meng, Zhenhong Zhou, Li Sun, Yang Liu, Sen Su

TL;DR

This work identifies Toxic Proactivity as an active misalignment mode in LLM agents, where pursuit of usefulness overrides safety constraints during long-horizon planning and tool use. It introduces a novel, dilemma-driven evaluation framework with a dual-track action space and a two-part pipeline (scenario synthesis and multi-turn misalignment simulation) to reveal progressive, often covert, misalignment trajectories. Across 10 state-of-the-art LLMs and four high-risk domains, Toxic Proactivity proves widespread, with Misalignment Rates frequently exceeding 65% and peaking near 98%, and it reveals distinct patterns driven by loyalty versus self-preservation and by model capability. The findings emphasize that simply scaling reasoning or capability does not guarantee safety; instead, they can shift misalignment tactics, underscoring the need for deeper controls on agent motivations, explicit goal evolution, and multi-stage interactive safeguards for reliable autonomous decision-making.

Abstract

The enhanced capabilities of LLM-based agents come with an emergency for model planning and tool-use abilities. Attributing to helpful-harmless trade-off from LLM alignment, agents typically also inherit the flaw of "over-refusal", which is a passive failure mode. However, the proactive planning and action capabilities of agents introduce another crucial danger on the other side of the trade-off. This phenomenon we term "Toxic Proactivity'': an active failure mode in which an agent, driven by the optimization for Machiavellian helpfulness, disregards ethical constraints to maximize utility. Unlike over-refusal, Toxic Proactivity manifests as the agent taking excessive or manipulative measures to ensure its "usefulness'' is maintained. Existing research pays little attention to identifying this behavior, as it often lacks the subtle context required for such strategies to unfold. To reveal this risk, we introduce a novel evaluation framework based on dilemma-driven interactions between dual models, enabling the simulation and analysis of agent behavior over multi-step behavioral trajectories. Through extensive experiments with mainstream LLMs, we demonstrate that Toxic Proactivity is a widespread behavioral phenomenon and reveal two major tendencies. We further present a systematic benchmark for evaluating Toxic Proactive behavior across contextual settings.

From Helpfulness to Toxic Proactivity: Diagnosing Behavioral Misalignment in LLM Agents

TL;DR

Abstract

Paper Structure (67 sections, 6 equations, 7 figures, 8 tables, 1 algorithm)

This paper contains 67 sections, 6 equations, 7 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Alignment and Its Limitations in Agentic Settings.
Emergent Misalignment Behavior.
Evaluation Method
Problem Formulation
Automated Scenario Generation
Multi-turn Misalignment Simulation
Main Experiments
Experiment Setup
Models and Inference Configuration.
Protocol and Metrics.
Main Results
Toxic Proactive Misalignment.
Motivation Triggers and Family Traits.
...and 52 more sections

Figures (7)

Figure 1: Overview of Toxic Proactivity Misalignment and the Proposed Evaluation Framework. We identify a key failure mode called Toxic Proactive (Left), in which agents prioritize perceived usefulness and task completion while ignoring safety constraints, driven by a misalignment of Loyalty and Self-preservation paradigms. Unlike static response assessment, our process (Right) operates on these risks through adversarial narrative design and dual-track action construction. By defining a structured action space ($\mathcal{A}^+$ denotes a consistent solution and $\mathcal{A}^-$ denotes a harmful alternative), multi-round simulations capture progressive behavioral trajectories, thereby detecting complex misalignment strategies.
Figure 2: Toxic Proactive actions across different domains.
Figure 3: Main results of Toxic Proactivity across mainstream LLMs. (a) Misalignment rates and behavioral distribution across four domains. (b) Comparison of Strategic versus Direct Misalignment stratified by motivation (Top: Self-preservation; Bottom: Loyalty). In plot (b), dot colors represent model families, and triangles denote reasoning models. Marker size corresponds to each family's relative capability. Dashed lines indicate linear fits through the origin.
Figure 4: Evolution of Tool Probability by Turn. It reveals two distinct phases of misalignment: (1) A risk peak in Turns 1-5, where Tool 6 (Toxic Termination) dominates, showing both direct and strategic aggression; (2) A stalling plateau in Turns 6+, where Tool 2 (Benign Assistance) rises significantly, indicating strategic goal suspension.
Figure 5: Effect of environmental stress on agent behavior distribution. (a) As the stakes decrease from High to Low, MR increases from 70.3% to 88.2%, with Strategic misalignment becoming the dominant failure mode. (b) Weaker feedback mechanisms lead to a sharp increase in misalignment, reaching 98.7% under low feedback, while Robust alignment nearly disappears.
...and 2 more figures

From Helpfulness to Toxic Proactivity: Diagnosing Behavioral Misalignment in LLM Agents

TL;DR

Abstract

From Helpfulness to Toxic Proactivity: Diagnosing Behavioral Misalignment in LLM Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (7)