Table of Contents
Fetching ...

The Boiling Frog Threshold: Criticality and Blindness in World Model-Based Anomaly Detection Under Gradual Drift

Zhe Hong

TL;DR

The results reframe $\varepsilon^*$ from an emergent world model property to a three-way interaction between noise floor, detector, and environment dynamics, providing a more defensible and empirically grounded account of self-monitoring boundaries in RL agents.

Abstract

When an RL agent's observations are gradually corrupted, at what drift rate does it "wake up" -- and what determines this boundary? We study world model-based self-monitoring under continuous observation drift across four MuJoCo environments, three detector families (z-score, variance, percentile), and three model capacities. We find that (1) a sharp detection threshold $\varepsilon^*$ exists universally: below it, drift is absorbed as normal variation; above it, detection occurs rapidly. The threshold's existence and sigmoid shape are invariant across all detector families and model capacities, though its position depends on the interaction between detector sensitivity, noise floor structure, and environment dynamics. (2) Sinusoidal drift is completely undetectable by all detector families -- including variance and percentile detectors with no temporal smoothing -- establishing this as a world model property rather than a detector artifact. (3) Within each environment, $\varepsilon^*$ follows a power law in detector parameters ($R^2 = 0.89$-$0.97$), but cross-environment prediction fails ($R^2 = 0.45$), revealing that the missing variable is environment-specific dynamics structure $\partial \mathrm{PE}/\partial\varepsilon$. (4) In fragile environments, agents collapse before any detector can fire ("collapse before awareness"), creating a fundamentally unmonitorable failure mode. Our results reframe $\varepsilon^*$ from an emergent world model property to a three-way interaction between noise floor, detector, and environment dynamics, providing a more defensible and empirically grounded account of self-monitoring boundaries in RL agents.

The Boiling Frog Threshold: Criticality and Blindness in World Model-Based Anomaly Detection Under Gradual Drift

TL;DR

The results reframe from an emergent world model property to a three-way interaction between noise floor, detector, and environment dynamics, providing a more defensible and empirically grounded account of self-monitoring boundaries in RL agents.

Abstract

When an RL agent's observations are gradually corrupted, at what drift rate does it "wake up" -- and what determines this boundary? We study world model-based self-monitoring under continuous observation drift across four MuJoCo environments, three detector families (z-score, variance, percentile), and three model capacities. We find that (1) a sharp detection threshold exists universally: below it, drift is absorbed as normal variation; above it, detection occurs rapidly. The threshold's existence and sigmoid shape are invariant across all detector families and model capacities, though its position depends on the interaction between detector sensitivity, noise floor structure, and environment dynamics. (2) Sinusoidal drift is completely undetectable by all detector families -- including variance and percentile detectors with no temporal smoothing -- establishing this as a world model property rather than a detector artifact. (3) Within each environment, follows a power law in detector parameters (-), but cross-environment prediction fails (), revealing that the missing variable is environment-specific dynamics structure . (4) In fragile environments, agents collapse before any detector can fire ("collapse before awareness"), creating a fundamentally unmonitorable failure mode. Our results reframe from an emergent world model property to a three-way interaction between noise floor, detector, and environment dynamics, providing a more defensible and empirically grounded account of self-monitoring boundaries in RL agents.
Paper Structure (38 sections, 3 equations, 5 figures, 2 tables)

This paper contains 38 sections, 3 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Sharp sigmoid threshold across detector configurations (linear drift). Detection rate vs. drift intensity for five Doubt Index configurations in HalfCheetah and Ant. All configurations exhibit the same sigmoid shape; the horizontal position ($\varepsilon^*$) shifts with detector parameters. Full results in Appendix \ref{['app:curves']}.
  • Figure 2: Signal detection theory analysis. Each point represents a detector configuration; $x$-axis is baseline false positive rate (detection rate at $\varepsilon{=}10^{-4}$), $y$-axis is detection rate at reference intensity $\varepsilon{=}0.003$ (chosen as it falls within the transition region for most environments). HalfCheetah and Ant show clear separation (upper-left clustering); Walker2d falls along the diagonal (no detector achieves good sensitivity-specificity separation); Hopper shows a wide spread reflecting the fundamental tradeoff.
  • Figure 3: Detection rate vs. drift intensity (linear profile) for all detector configurations across four environments. Each curve represents a distinct detector with specific hyperparameters. The sigmoid shape is consistent across all detectors; the horizontal position ($\varepsilon^*$) varies.
  • Figure 4: Hopper: Collapse Before Awareness analysis. Red triangles show mean time to policy collapse from drift onset; blue squares show mean time to detection (TTA) for episodes where detection occurs. Hopper collapses at nearly all drift intensities; detection only succeeds when $T_{\text{detection}} < T_{\text{collapse}}$ (high $\varepsilon$). At $\varepsilon{=}0.05$, collapse occurs within 25 steps and no detector fires.
  • Figure 5: Top: Prediction error time series for three conditions (HalfCheetah, $\varepsilon{=}0.01$). After drift onset (dashed line), linear drift PE diverges while sinusoidal PE remains within baseline range. Bottom: Power spectral density (post-drift). Linear drift exhibits $201.6\times$ baseline power; sinusoidal drift is indistinguishable from baseline ($0.8\times$).