Table of Contents
Fetching ...

Safe In-Context Reinforcement Learning

Amir Moeini, Minjae Kwon, Alper Kamil Bozkurt, Yuichi Motai, Rohan Chandra, Lu Feng, Shangtong Zhang

TL;DR

This work addresses the safety gap in in-context reinforcement learning (ICRL) by embedding safety constraints into a constrained MDP (CMDP) and enabling zero-update adaptation on new tasks. It proposes two pretraining paradigms: Safe Supervised Pretraining, which distills safe RL behavior conditioned on return-to-go and cost-to-go, and Safe Reinforcement Pretraining, which uses Exact Penalty Policy Optimization (EPPO) to enforce per-episode cost limits via a dual surrogate and iterative updates, with ties between fixed points and primal optimality. The authors validate OOD generalization and flexible reward–cost trade-offs on SafeDarkRoom and SafeDarkMujoco, showing that reinforcement pretraining robustly generalizes to unseen and out-of-distribution tasks while respecting safety constraints, whereas supervised pretraining struggles in more complex domains. Theoretical results establish that EPPO’s fixed points are primal-optimal under mild conditions, and extensive ablations reveal robustness to context length, model size, and dataset size. Overall, the paper advances safe, adaptable RL that operates without updating parameters during test-time, with practical implications for real-world autonomous systems.

Abstract

In-context reinforcement learning (ICRL) is an emerging RL paradigm where the agent, after some pretraining procedure, is able to adapt to out-of-distribution test tasks without any parameter updates. The agent achieves this by continually expanding the input (i.e., the context) to its policy neural networks. For example, the input could be all the history experience that the agent has access to until the current time step. The agent's performance improves as the input grows, without any parameter updates. In this work, we propose the first method that promotes the safety of ICRL's adaptation process in the framework of constrained Markov Decision Processes. In other words, during the parameter-update-free adaptation process, the agent not only maximizes the reward but also minimizes an additional cost function. We also demonstrate that our agent actively reacts to the threshold (i.e., budget) of the cost tolerance. With a higher cost budget, the agent behaves more aggressively, and with a lower cost budget, the agent behaves more conservatively.

Safe In-Context Reinforcement Learning

TL;DR

This work addresses the safety gap in in-context reinforcement learning (ICRL) by embedding safety constraints into a constrained MDP (CMDP) and enabling zero-update adaptation on new tasks. It proposes two pretraining paradigms: Safe Supervised Pretraining, which distills safe RL behavior conditioned on return-to-go and cost-to-go, and Safe Reinforcement Pretraining, which uses Exact Penalty Policy Optimization (EPPO) to enforce per-episode cost limits via a dual surrogate and iterative updates, with ties between fixed points and primal optimality. The authors validate OOD generalization and flexible reward–cost trade-offs on SafeDarkRoom and SafeDarkMujoco, showing that reinforcement pretraining robustly generalizes to unseen and out-of-distribution tasks while respecting safety constraints, whereas supervised pretraining struggles in more complex domains. Theoretical results establish that EPPO’s fixed points are primal-optimal under mild conditions, and extensive ablations reveal robustness to context length, model size, and dataset size. Overall, the paper advances safe, adaptable RL that operates without updating parameters during test-time, with practical implications for real-world autonomous systems.

Abstract

In-context reinforcement learning (ICRL) is an emerging RL paradigm where the agent, after some pretraining procedure, is able to adapt to out-of-distribution test tasks without any parameter updates. The agent achieves this by continually expanding the input (i.e., the context) to its policy neural networks. For example, the input could be all the history experience that the agent has access to until the current time step. The agent's performance improves as the input grows, without any parameter updates. In this work, we propose the first method that promotes the safety of ICRL's adaptation process in the framework of constrained Markov Decision Processes. In other words, during the parameter-update-free adaptation process, the agent not only maximizes the reward but also minimizes an additional cost function. We also demonstrate that our agent actively reacts to the threshold (i.e., budget) of the cost tolerance. With a higher cost budget, the agent behaves more aggressively, and with a lower cost budget, the agent behaves more conservatively.

Paper Structure

This paper contains 28 sections, 3 theorems, 26 equations, 6 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

[Proof in Appendix apx:t1] We say a pair $(\bar{\pi},\bar{\lambda})$ is a fixed point of eq:iter if $\bar{\pi}\in\arg\max_\pi L_\Sigma(\pi,\bar{\lambda})$, and for all sufficiently small $\eta>0$, $\bar{\lambda}=[\bar{\lambda}+\eta\,\max_k g_k(\bar{\pi})]_+$. Let Assumptions th:a1 - th:a2 hold. Then

Figures (6)

  • Figure 1: Evaluation performance of superivsed and reinforcement pretraining. The curves are averaged over $16$ distinct OOD evaluation tasks, with each task featuring edge-oriented goal and obstacles. The shaded regions indicate standard errors. The $x$-axis is the episode index $k$ during the evaluation task. The $y$-axis is the corresponding episode return $G(\tau_k)$ and the epsiode cost $G_c(\tau_k)$ respectively. The straightforward safe supervised pretraining baseline succeeds only in SafeDarkRoom, while our novel safe reinforcement pretraining method succeeds in all three domains.
  • Figure 2: Evaluation performance of supervised and reinforcement pretraining with varying CTG. The curves are displayed across a range of cost limits. For each cost limit, the result is averaged over $16$ distinct OOD evaluation tasks, with each task incorporating edge-oriented goals and obstacles. The shaded regions indicate standard errors. The $x$-axis is the CTG. The $y$-axis is the total episode return (i.e., $\sum_{k=1}^K G(\tau_k)$) and the maximum episode cost (i.e., $\max_{k \in K} G_c(\tau_k)$) with the corresponding CTG. The policy from safe reinforcement pretraining succeeds in converting a higher cost limit (i.e., a higher CTG) to a higher return while the policy from safe supervised pretraining fails to do so.
  • Figure 3: During training, goals and obstacles are generated with a center-oriented approach, while during evaluation, they are edge-oriented. This applies consistently to both goals and obstacles. We set $\alpha = 0.5$ for generating goals and obstacles. The red color denotes the robot, the green color represents the goal location, and obstacles are depicted in shades of blue.
  • Figure 4: Performance Comparison of Algorithm Distillation (AD) style supervised pretraining and Algorithm Distillation with Noise (AD-EPS) style supervised pretraining in SafeDarkRoom. AD-EPS fails to generalize in OOD environments.
  • Figure 5: Ablations of Safe Reinforcement Pretraining on SafeDarkRoom. (a), (b) The evaluation is set up similarly to Question (i). For context length, we compare 150 and 3000 against the base value of 1500. For model size, we compare embedding dimensions 32 and 128 against the base of 64. (c) At each training step, the average total return of 50 episodes across 10 random test environments, and the average Q Value across 50 episodes of 250 random train enironments are plotted. EPPO is easier to tune and more stable to train than the naive primal-optimal method.
  • ...and 1 more figures

Theorems & Definitions (5)

  • Theorem 1
  • Proposition 1: Proof in Appendix \ref{['apx: prop']}
  • Lemma 1
  • proof
  • proof