Table of Contents
Fetching ...

On the Structural Non-Preservation of Epistemic Behaviour under Policy Transformation

Alexander Galozy

TL;DR

These results identify structural conditions under which probe-conditioned behavioural separation is not preserved under common policy transformations, and identify structural conditions under which behavioural distance decreases under convex aggregation and under continued optimisation with skewed latent priors.

Abstract

Reinforcement learning (RL) agents under partial observability often condition actions on internally accumulated information such as memory or inferred latent context. We formalise such information-conditioned interaction patterns as behavioural dependency: variation in action selection with respect to internal information under fixed observations. This induces a probe-relative notion of $ε$-behavioural equivalence and a within-policy behavioural distance that quantifies probe sensitivity. We establish three structural results. First, the set of policies exhibiting non-trivial behavioural dependency is not closed under convex aggregation. Second, behavioural distance contracts under convex combination. Third, we prove a sufficient local condition under which gradient ascent on a skewed mixture objective decreases behavioural distance when a dominant-mode gradient aligns with the direction of steepest contraction. Minimal bandit and partially observable gridworld experiments provide controlled witnesses of these mechanisms. In the examined settings, behavioural distance decreases under convex aggregation and under continued optimisation with skewed latent priors, and in these experiments it precedes degradation under latent prior shift. These results identify structural conditions under which probe-conditioned behavioural separation is not preserved under common policy transformations.

On the Structural Non-Preservation of Epistemic Behaviour under Policy Transformation

TL;DR

These results identify structural conditions under which probe-conditioned behavioural separation is not preserved under common policy transformations, and identify structural conditions under which behavioural distance decreases under convex aggregation and under continued optimisation with skewed latent priors.

Abstract

Reinforcement learning (RL) agents under partial observability often condition actions on internally accumulated information such as memory or inferred latent context. We formalise such information-conditioned interaction patterns as behavioural dependency: variation in action selection with respect to internal information under fixed observations. This induces a probe-relative notion of -behavioural equivalence and a within-policy behavioural distance that quantifies probe sensitivity. We establish three structural results. First, the set of policies exhibiting non-trivial behavioural dependency is not closed under convex aggregation. Second, behavioural distance contracts under convex combination. Third, we prove a sufficient local condition under which gradient ascent on a skewed mixture objective decreases behavioural distance when a dominant-mode gradient aligns with the direction of steepest contraction. Minimal bandit and partially observable gridworld experiments provide controlled witnesses of these mechanisms. In the examined settings, behavioural distance decreases under convex aggregation and under continued optimisation with skewed latent priors, and in these experiments it precedes degradation under latent prior shift. These results identify structural conditions under which probe-conditioned behavioural separation is not preserved under common policy transformations.
Paper Structure (31 sections, 4 theorems, 18 equations, 6 figures, 3 tables)

This paper contains 31 sections, 4 theorems, 18 equations, 6 figures, 3 tables.

Key Result

Lemma 1

If the value penalty assumption holds and $d(\pi) \le \epsilon$, then

Figures (6)

  • Figure 1: Behavioural probe evaluation and representative trajectories in the partially observable gridworld. Left: The evaluation protocol. The agent acquires latent mode information to induce an internal hidden state $h_m$. It is subsequently evaluated at a fixed observation $o^\star$ to measure the conditional policy $\pi(\cdot \mid o^\star, h_m)$. Second: Probing policy. The agent maintains separated internal representations and successfully executes distinct actions contingent on the initial mode. Third: Shortcut policy. The agent consistently navigates toward the same goal irrespective of the latent mode. Right: Aggregated policy. The agent may execute the probing action, but it fails to exhibit the consistent information-conditioned action differences of the probing agent.
  • Figure 2: Robustness under latent prior shift.Left: Average return of three policies evaluated under a biased prior and a reversed prior. The Shortcut policy attains high return under the biased prior despite zero behavioural distance, but degrades sharply under prior reversal. The Probing policy remains stable across priors ($mean \pm se$). Right: Convex mixtures $\pi_\alpha = \alpha \pi_{\text{probe}} + (1-\alpha)\pi_{\text{shortcut}}$. Behavioural distance decreases linearly with $\alpha$, and robustness under prior shift decreases correspondingly, indicating that sensitivity to latent distribution shift is governed by probe-conditioned behavioural separation and not by biased-prior reward alone.
  • Figure 3: Optimisation-induced structural erosion under a heavily biased prior (98%).Left: Behavioural erosion. Return under the dominant prior remains stable while reversed-prior performance degrades. Behavioural distance decreases during biased optimisation and stabilises at a lower value. Middle: Representational erosion. Absolute and normalised hidden-state separation contract over training. Right: Mechanistic verification. Projection of task gradients onto $\mathbf{v}_d = -\nabla d / \|\nabla d\|$ yields a persistent positive net structural force under the biased prior, consistent with Theorem \ref{['thm:gradient_erosion']}. Shaded regions denote $\pm$ one standard deviation across 10 seeds.
  • Figure 4: Abstract Epistemic MDP. A three-phase partially observable task consisting of a probe phase revealing latent mode $m$, a distractor delay phase independent of $m$, and a final evaluation phase at a fixed observation $o^\ast$ that requires retention of probe information. The task enforces an epistemic bottleneck between information acquisition and reward.
  • Figure 5: Sensitivity to Prior Skew. (Left) Functional degradation on the rare mode most pronounced at $\delta=0.98$. (Center) Behavioural distance $d(\pi)$ decreasing as skew increases. (Right) Decay of internal representation geometry across biased priors, preceding functional failure.
  • ...and 1 more figures

Theorems & Definitions (9)

  • Definition 1: $\epsilon$-Behavioural Equivalence
  • Lemma 1: Robustness Requires Epistemic Distance
  • proof
  • Lemma 2: Convex Contraction
  • proof
  • Proposition 1: Non-Closure under Convex Aggregation
  • proof
  • Theorem 1: Conditional Local Contraction Under Gradient Alignment
  • proof