Table of Contents
Fetching ...

Adaptive Shielding for Safe Reinforcement Learning under Hidden-Parameter Dynamics Shifts

Minjae Kwon, Tyler Ingebrand, Ufuk Topcu, Lu Feng

TL;DR

This work introduces safety-regularized optimization that proactively trains the policy away from high-cost regions and proves that prediction errors in the shielding connect with bounds on the average cost rate, and proves that prediction errors in the shielding connect with bounds on the average cost rate.

Abstract

Unseen shifts in environment dynamics, driven by hidden parameters such as friction or gravity, create a challenge for maintaining safety. We address this challenge by proposing Adaptive Shielding, a framework for safe reinforcement learning in constrained hidden-parameter Markov decision processes. A function encoder infers a low-dimensional representation of the underlying dynamics online from transition data, allowing the shield to adapt. To ensure safety during this process, we use a two-layer strategy. First, we introduce safety-regularized optimization that proactively trains the policy away from high-cost regions. Second, the adaptive shielding reactively uses the inferred dynamics to forecast safety risks and applies uncertainty-aware bounds using conformal prediction to filter unsafe actions. We prove that prediction errors in the shielding connect with bounds on the average cost rate. Empirically, across Safe-Gym benchmarks with varying hidden parameters, our approach outperforms baselines on the return-safety trade-off and generalizes reliably to unseen dynamics, while incurring only modest execution-time overhead. Code is available at https://github.com/safe-autonomy-lab/AdaptiveShieldingFE.

Adaptive Shielding for Safe Reinforcement Learning under Hidden-Parameter Dynamics Shifts

TL;DR

This work introduces safety-regularized optimization that proactively trains the policy away from high-cost regions and proves that prediction errors in the shielding connect with bounds on the average cost rate, and proves that prediction errors in the shielding connect with bounds on the average cost rate.

Abstract

Unseen shifts in environment dynamics, driven by hidden parameters such as friction or gravity, create a challenge for maintaining safety. We address this challenge by proposing Adaptive Shielding, a framework for safe reinforcement learning in constrained hidden-parameter Markov decision processes. A function encoder infers a low-dimensional representation of the underlying dynamics online from transition data, allowing the shield to adapt. To ensure safety during this process, we use a two-layer strategy. First, we introduce safety-regularized optimization that proactively trains the policy away from high-cost regions. Second, the adaptive shielding reactively uses the inferred dynamics to forecast safety risks and applies uncertainty-aware bounds using conformal prediction to filter unsafe actions. We prove that prediction errors in the shielding connect with bounds on the average cost rate. Empirically, across Safe-Gym benchmarks with varying hidden parameters, our approach outperforms baselines on the return-safety trade-off and generalizes reliably to unseen dynamics, while incurring only modest execution-time overhead. Code is available at https://github.com/safe-autonomy-lab/AdaptiveShieldingFE.

Paper Structure

This paper contains 29 sections, 4 theorems, 64 equations, 13 figures, 6 tables, 1 algorithm.

Key Result

Proposition 1

Let $\Pi_{\text{zero-violation}}$ denote the set of zero-violation policies, defined as $\{\pi \mid J_C(\pi) = 0\}$. Then, for any $\alpha \geq 0$, the optimal policy obtained by maximizing the augmented objective function $J_{\text{aug}}(\pi)$ within $\Pi_{\text{zero-violation}}$ is equivalent to t

Figures (13)

  • Figure 1: Training Dynamics. Results display the mean reward and cost rate (%) over the last 20 epochs across seeds. The top-left position is desirable, indicating higher returns with lower cost rates. Solid points represent mean return and cost rate, while transparent points depict individual seed results.
  • Figure 2: OOD Evaluation. Trade-off between average episodic return and cost rate in out-of-distribution domains. The top-left position is desirable, indicating higher returns with lower cost rates. Solid points represent mean return and cost rate, while transparent points depict individual seed results.
  • Figure 3: Ablation study on Representation. "Oracle-" refers to a policy directly informed of hidden parameters, while "FE-" denotes the function encoder's representation derived from observations.
  • Figure 4: (a) Illustration of how function encoders obtain proxy representations of the underlying hidden parameters using online samples. (b) A naive approach using transformer encoder to infer hidden parameters from a sequence of online samples. MLP stands for multi-layer perceptron in the Figure.
  • Figure 5: Ablation study evaluating the performance of various dynamics predictors in forecasting the next state. The $y$-axis denotes the average per-sample $\ell_1$-norm error between the true and predicted next states on the test dataset.
  • ...and 8 more figures

Theorems & Definitions (11)

  • Proposition 1: Reward Consistency within Zero-Violation Policies
  • proof : Proof Sketch
  • Theorem 4.1
  • proof : Proof Sketch
  • proof
  • Remark 1: Validity under Sequential Dependence
  • Lemma 1
  • proof
  • proof
  • Lemma 2
  • ...and 1 more