Table of Contents
Fetching ...

Discovering What You Can Control: Interventional Boundary Discovery for Reinforcement Learning

Jiaxin Liu

Abstract

Selecting relevant state dimensions in the presence of confounded distractors is a causal identification problem: observational statistics alone cannot reliably distinguish dimensions that correlate with actions from those that actions cause. We formalize this as discovering the agent's Causal Sphere of Influence and propose Interventional Boundary Discovery IBD, which applies Pearl's do-operator to the agent's own actions and uses two-sample testing to produce an interpretable binary mask over observation dimensions. IBD requires no learned models and composes with any downstream RL algorithm as a preprocessing step. Across 12 continuous control settings with up to 100 distractor dimensions, we find that: (1) observational feature selection can actively select confounded distractors while discarding true causal dimensions; (2) full-state RL degrades sharply once distractors outnumber relevant features by roughly 3:1 in our benchmarks; and (3)IBD closely tracks oracle performance across all distractor levels tested, with gains transferring across SAC and TD3.

Discovering What You Can Control: Interventional Boundary Discovery for Reinforcement Learning

Abstract

Selecting relevant state dimensions in the presence of confounded distractors is a causal identification problem: observational statistics alone cannot reliably distinguish dimensions that correlate with actions from those that actions cause. We formalize this as discovering the agent's Causal Sphere of Influence and propose Interventional Boundary Discovery IBD, which applies Pearl's do-operator to the agent's own actions and uses two-sample testing to produce an interpretable binary mask over observation dimensions. IBD requires no learned models and composes with any downstream RL algorithm as a preprocessing step. Across 12 continuous control settings with up to 100 distractor dimensions, we find that: (1) observational feature selection can actively select confounded distractors while discarding true causal dimensions; (2) full-state RL degrades sharply once distractors outnumber relevant features by roughly 3:1 in our benchmarks; and (3)IBD closely tracks oracle performance across all distractor levels tested, with gains transferring across SAC and TD3.
Paper Structure (41 sections, 4 theorems, 3 equations, 4 figures, 7 tables, 1 algorithm)

This paper contains 41 sections, 4 theorems, 3 equations, 4 figures, 7 tables, 1 algorithm.

Key Result

Proposition 3.2

There exists an MDP in which a distractor dimension $d_k$ achieves the highest mutual information with actions among all observation dimensions, yet $\mathrm{do}(\mathbf{a})$ has zero causal effect on $d_k$.

Figures (4)

  • Figure 1: IBD pipeline and core result.Left: Phase 1: a structured random probe policy collects baseline trajectories and interventional trajectories where actions are replaced by random noise. Phase 2: per-dimension Welch $t$-tests with BH correction produce a binary causal mask applied to a downstream RL algorithm. Right: Feature ranking on reacher_hard (6 true dims, 50 distractors). Under MI-based ranking, causal and distractor dimensions are interleaved; under IBD ranking, all causal dimensions fall above the $\alpha{=}0.05$ threshold and all distractors fall below.
  • Figure 2: Distractor scaling curve. Episode return as a function of distractor dimensionality (6, 50, 100) for walker_walk (left), cheetah_run (center), and reacher_hard (right). Full State (red) degrades as distractors increase; IBD (blue) tracks oracle performance across all distractor counts. Shaded regions: $\pm$1 std over 5 seeds.
  • Figure 3: Diagnostic decomposition. Episode return across 8 representative settings. Three regimes emerge. (1) walker_walk (easy): all methods perform similarly, so no selection is needed. (2) Most medium/hard settings: Oracle $\gg$ Full State but IBD$\approx$ Oracle, indicating that distractors are the bottleneck and IBD resolves it. (3) hopper_hop: all methods near zero, indicating that the bottleneck is exploration, not feature selection.
  • Figure 4: Partial controllability detection. Recall on partially controllable dimensions as a function of mixing coefficient $\alpha$, for cheetah_run and walker_walk. At $\alpha = 0$ (purely exogenous), recall is correctly zero; by $\alpha \approx 0.05$ (${\sim}5\%$ causal variance), recall reaches 1.0. The dashed line shows overall F1 on cheetah_run, confirming that boundary quality remains high. Precision $\geq 0.92$ at all $\alpha$ values. Shaded regions: $\pm$1 std over 3 seeds.

Theorems & Definitions (5)

  • Definition 3.1: Causal Sphere of Influence
  • Proposition 3.2: MI can rank distractors above causal dimensions
  • Proposition 3.3: Confounding immunity
  • Proposition 3.4: Interventional detectability
  • Proposition 3.5: Finite-sample error control