Table of Contents
Fetching ...

The Irrational Machine: Neurosis and the Limits of Algorithmic Safety

Daniel Howard

TL;DR

The paper tackles the problem of irrational behavior in embodied AI by framing neurosis as misaligned yet internally coherent dynamics arising from planning, uncertainty handling, and aversive memory. It introduces a taxonomy of neural-like modalities (e.g., action flip-flop, plan churn, and belief incoherence) in grid-world settings, each paired with detectors and light-weight escape policies. To surface global safety and efficiency failures that local patches miss, the authors propose a genetic-programming–driven destructive-testing framework that evolves adversarial worlds and traces to elicit law-driven pathologies. They then outline a three-track repair loop (governor, edits, fine-tuning) augmented by counterfactual analysis and verification to convert pathology into auditable, robust behavior, with ongoing maintenance via a feedback loop. The approach offers a practical path toward safer, more reliable embodied agents by diagnosing, testing, repairing, and verifying neurosis-like behaviors through interpretable, auditable mechanisms.

Abstract

We present a framework for characterizing neurosis in embodied AI: behaviors that are internally coherent yet misaligned with reality, arising from interactions among planning, uncertainty handling, and aversive memory. In a grid navigation stack we catalogue recurrent modalities including flip-flop, plan churn, perseveration loops, paralysis and hypervigilance, futile search, belief incoherence, tie break thrashing, corridor thrashing, optimality compulsion, metric mismatch, policy oscillation, and limited-visibility variants. For each we give lightweight online detectors and reusable escape policies (short commitments, a margin to switch, smoothing, principled arbitration). We then show that durable phobic avoidance can persist even under full visibility when learned aversive costs dominate local choice, producing long detours despite globally safe routes. Using First/Second/Third Law as engineering shorthand for safety latency, command compliance, and resource efficiency, we argue that local fixes are insufficient; global failures can remain. To surface them, we propose genetic-programming based destructive testing that evolves worlds and perturbations to maximize law pressure and neurosis scores, yielding adversarial curricula and counterfactual traces that expose where architectural revision, not merely symptom-level patches, is required.

The Irrational Machine: Neurosis and the Limits of Algorithmic Safety

TL;DR

The paper tackles the problem of irrational behavior in embodied AI by framing neurosis as misaligned yet internally coherent dynamics arising from planning, uncertainty handling, and aversive memory. It introduces a taxonomy of neural-like modalities (e.g., action flip-flop, plan churn, and belief incoherence) in grid-world settings, each paired with detectors and light-weight escape policies. To surface global safety and efficiency failures that local patches miss, the authors propose a genetic-programming–driven destructive-testing framework that evolves adversarial worlds and traces to elicit law-driven pathologies. They then outline a three-track repair loop (governor, edits, fine-tuning) augmented by counterfactual analysis and verification to convert pathology into auditable, robust behavior, with ongoing maintenance via a feedback loop. The approach offers a practical path toward safer, more reliable embodied agents by diagnosing, testing, repairing, and verifying neurosis-like behaviors through interpretable, auditable mechanisms.

Abstract

We present a framework for characterizing neurosis in embodied AI: behaviors that are internally coherent yet misaligned with reality, arising from interactions among planning, uncertainty handling, and aversive memory. In a grid navigation stack we catalogue recurrent modalities including flip-flop, plan churn, perseveration loops, paralysis and hypervigilance, futile search, belief incoherence, tie break thrashing, corridor thrashing, optimality compulsion, metric mismatch, policy oscillation, and limited-visibility variants. For each we give lightweight online detectors and reusable escape policies (short commitments, a margin to switch, smoothing, principled arbitration). We then show that durable phobic avoidance can persist even under full visibility when learned aversive costs dominate local choice, producing long detours despite globally safe routes. Using First/Second/Third Law as engineering shorthand for safety latency, command compliance, and resource efficiency, we argue that local fixes are insufficient; global failures can remain. To surface them, we propose genetic-programming based destructive testing that evolves worlds and perturbations to maximize law pressure and neurosis scores, yielding adversarial curricula and counterfactual traces that expose where architectural revision, not merely symptom-level patches, is required.

Paper Structure

This paper contains 49 sections, 2 equations, 17 figures, 5 tables.

Figures (17)

  • Figure 1: Action flip–flop
  • Figure 2: Plan churn: tiny, rolling cost changes lead to frequent re-planning with large edits to the plan prefix (three overlaid prefixes shown), even though start and goal are fixed.
  • Figure 3: Perseveration loop: a shallow “pocket” induces an ABAB cycle between cells A and B; the robot revisits the same states without reducing distance to the goal.
  • Figure 4: Paralysis: the planner keeps evaluating near-equal options A and B, but no first step is executed; the agent remains at S while time elapses.
  • Figure 5: Hypervigilance: under near-tie ambiguity between avenues A and B, the controller pauses to re-evaluate, increasing planning time while delaying execution.
  • ...and 12 more figures