Table of Contents
Fetching ...

Learning in Context, Guided by Choice: A Reward-Free Paradigm for Reinforcement Learning with Transformers

Juncheng Dong, Bowen He, Moyang Guo, Ethan X. Fang, Zhuoran Yang, Vahid Tarokh

TL;DR

This work addresses the challenge of reward dependence in in-context reinforcement learning by introducing ICPRL, a reward-free paradigm that relies solely on preference feedback for both pretraining and deployment. It studies two settings, I-PRL (per-step preferences) and T-PRL (trajectory preferences), and develops both a supervised baseline (DP^2T) and two reward-free, preference-native pretraining frameworks: ICPO (in-context preference optimization) and ICRG (in-context reward generation). ICPO directly optimizes a transformer policy using a learned in-context utility and a closed-form policy update, while ICRG constructs a reward representation from trajectory preferences and then leverages standard ICRL methods. Experiments on DarkRoom and Meta-World demonstrate strong in-context generalization to unseen tasks, with ICPO and ICRG achieving competitive performance relative to reward-supervised baselines and even outperforming in some continuous-control settings. The proposed approaches offer a data-efficient, reward-free route to train transformer-based meta-policies capable of adapting to new tasks solely from preference information, with potential for low-cost annotation via human or LLM-based preference labeling.

Abstract

In-context reinforcement learning (ICRL) leverages the in-context learning capabilities of transformer models (TMs) to efficiently generalize to unseen sequential decision-making tasks without parameter updates. However, existing ICRL methods rely on explicit reward signals during pretraining, which limits their applicability when rewards are ambiguous, hard to specify, or costly to obtain. To overcome this limitation, we propose a new learning paradigm, In-Context Preference-based Reinforcement Learning (ICPRL), in which both pretraining and deployment rely solely on preference feedback, eliminating the need for reward supervision. We study two variants that differ in the granularity of feedback: Immediate Preference-based RL (I-PRL) with per-step preferences, and Trajectory Preference-based RL (T-PRL) with trajectory-level comparisons. We first show that supervised pretraining, a standard approach in ICRL, remains effective under preference-only context datasets, demonstrating the feasibility of in-context reinforcement learning using only preference signals. To further improve data efficiency, we introduce alternative preference-native frameworks for I-PRL and T-PRL that directly optimize TM policies from preference data without requiring reward signals nor optimal action labels.Experiments on dueling bandits, navigation, and continuous control tasks demonstrate that ICPRL enables strong in-context generalization to unseen tasks, achieving performance comparable to ICRL methods trained with full reward supervision.

Learning in Context, Guided by Choice: A Reward-Free Paradigm for Reinforcement Learning with Transformers

TL;DR

This work addresses the challenge of reward dependence in in-context reinforcement learning by introducing ICPRL, a reward-free paradigm that relies solely on preference feedback for both pretraining and deployment. It studies two settings, I-PRL (per-step preferences) and T-PRL (trajectory preferences), and develops both a supervised baseline (DP^2T) and two reward-free, preference-native pretraining frameworks: ICPO (in-context preference optimization) and ICRG (in-context reward generation). ICPO directly optimizes a transformer policy using a learned in-context utility and a closed-form policy update, while ICRG constructs a reward representation from trajectory preferences and then leverages standard ICRL methods. Experiments on DarkRoom and Meta-World demonstrate strong in-context generalization to unseen tasks, with ICPO and ICRG achieving competitive performance relative to reward-supervised baselines and even outperforming in some continuous-control settings. The proposed approaches offer a data-efficient, reward-free route to train transformer-based meta-policies capable of adapting to new tasks solely from preference information, with potential for low-cost annotation via human or LLM-based preference labeling.

Abstract

In-context reinforcement learning (ICRL) leverages the in-context learning capabilities of transformer models (TMs) to efficiently generalize to unseen sequential decision-making tasks without parameter updates. However, existing ICRL methods rely on explicit reward signals during pretraining, which limits their applicability when rewards are ambiguous, hard to specify, or costly to obtain. To overcome this limitation, we propose a new learning paradigm, In-Context Preference-based Reinforcement Learning (ICPRL), in which both pretraining and deployment rely solely on preference feedback, eliminating the need for reward supervision. We study two variants that differ in the granularity of feedback: Immediate Preference-based RL (I-PRL) with per-step preferences, and Trajectory Preference-based RL (T-PRL) with trajectory-level comparisons. We first show that supervised pretraining, a standard approach in ICRL, remains effective under preference-only context datasets, demonstrating the feasibility of in-context reinforcement learning using only preference signals. To further improve data efficiency, we introduce alternative preference-native frameworks for I-PRL and T-PRL that directly optimize TM policies from preference data without requiring reward signals nor optimal action labels.Experiments on dueling bandits, navigation, and continuous control tasks demonstrate that ICPRL enables strong in-context generalization to unseen tasks, achieving performance comparable to ICRL methods trained with full reward supervision.
Paper Structure (48 sections, 23 equations, 11 figures, 2 algorithms)

This paper contains 48 sections, 23 equations, 11 figures, 2 algorithms.

Figures (11)

  • Figure 1: (a) ICRL methods adapt to new RL tasks using in-context learning, but both their pretraining and deployment require access to reward signals, which can be costly or impractical to obtain in many settings. (b) This work proposes a novel ICRL paradigm that uses only preference data for both pretraining and deployment, eliminating the need for explicit reward supervision.
  • Figure 2: I-PRL results under context datasets of varying quality (left to right: low, medium, and high quality) in DarkRoom (top) and Meta-World.
  • Figure 3: Meta-World (T-PRL) results with context datasets of low, medium, and high quality.
  • Figure 4: Transformer policy architecture. The bottom model illustrates the architecture of TM policies for the I-PRL setting. The top depicts the T-PRL setting where the reward values $\widehat{r}_h$ are approximated by the in-context reward estimator.
  • Figure 5: The architecture of the reward transformer.
  • ...and 6 more figures