Table of Contents
Fetching ...

Distilling Reinforcement Learning Algorithms for In-Context Model-Based Planning

Jaehyeon Son, Soochan Lee, Gunhee Kim

TL;DR

The paper tackles the limitation of in-context RL methods that imitate suboptimal source algorithms by introducing Distillation for In-Context Planning (DICP), which jointly learns an in-context dynamics model and policy using Transformers. By applying Model Predictive Control with an in-context learned world model, DICP can plan ahead and deviate from the source algorithm’s gradual updates, delivering superior performance with substantially fewer environment interactions across discrete Darkroom variants and continuous Meta-World ML1/ML10 benchmarks. Empirical results show state-of-the-art performance against both model-free and model-based meta-RL baselines, with ablations confirming the benefits of planning scale, longer context, and robustness to different source algorithms. The approach highlights the practical value of integrating planning into in-context learning, offering a scalable path to more sample-efficient meta-RL in complex domains, albeit with higher inference-time computation that can be mitigated by future efficiency improvements.

Abstract

Recent studies have shown that Transformers can perform in-context reinforcement learning (RL) by imitating existing RL algorithms, enabling sample-efficient adaptation to unseen tasks without parameter updates. However, these models also inherit the suboptimal behaviors of the RL algorithms they imitate. This issue primarily arises due to the gradual update rule employed by those algorithms. Model-based planning offers a promising solution to this limitation by allowing the models to simulate potential outcomes before taking action, providing an additional mechanism to deviate from the suboptimal behavior. Rather than learning a separate dynamics model, we propose Distillation for In-Context Planning (DICP), an in-context model-based RL framework where Transformers simultaneously learn environment dynamics and improve policy in-context. We evaluate DICP across a range of discrete and continuous environments, including Darkroom variants and Meta-World. Our results show that DICP achieves state-of-the-art performance while requiring significantly fewer environment interactions than baselines, which include both model-free counterparts and existing meta-RL methods.

Distilling Reinforcement Learning Algorithms for In-Context Model-Based Planning

TL;DR

The paper tackles the limitation of in-context RL methods that imitate suboptimal source algorithms by introducing Distillation for In-Context Planning (DICP), which jointly learns an in-context dynamics model and policy using Transformers. By applying Model Predictive Control with an in-context learned world model, DICP can plan ahead and deviate from the source algorithm’s gradual updates, delivering superior performance with substantially fewer environment interactions across discrete Darkroom variants and continuous Meta-World ML1/ML10 benchmarks. Empirical results show state-of-the-art performance against both model-free and model-based meta-RL baselines, with ablations confirming the benefits of planning scale, longer context, and robustness to different source algorithms. The approach highlights the practical value of integrating planning into in-context learning, offering a scalable path to more sample-efficient meta-RL in complex domains, albeit with higher inference-time computation that can be mitigated by future efficiency improvements.

Abstract

Recent studies have shown that Transformers can perform in-context reinforcement learning (RL) by imitating existing RL algorithms, enabling sample-efficient adaptation to unseen tasks without parameter updates. However, these models also inherit the suboptimal behaviors of the RL algorithms they imitate. This issue primarily arises due to the gradual update rule employed by those algorithms. Model-based planning offers a promising solution to this limitation by allowing the models to simulate potential outcomes before taking action, providing an additional mechanism to deviate from the suboptimal behavior. Rather than learning a separate dynamics model, we propose Distillation for In-Context Planning (DICP), an in-context model-based RL framework where Transformers simultaneously learn environment dynamics and improve policy in-context. We evaluate DICP across a range of discrete and continuous environments, including Darkroom variants and Meta-World. Our results show that DICP achieves state-of-the-art performance while requiring significantly fewer environment interactions than baselines, which include both model-free counterparts and existing meta-RL methods.

Paper Structure

This paper contains 20 sections, 4 equations, 10 figures, 10 tables, 3 algorithms.

Figures (10)

  • Figure 2: Learning curves of in-context RL approaches during the meta-test phase on discrete (1st row) and continuous (2nd and 3rd rows) environments. Our methods outperform model-free counterparts in both sample efficiency and overall performance. Results are averaged over 5 and 3 train-test splits for discrete and continuous benchmarks, respectively. We also report the mean success rate across all 50 tasks in Meta-World ML1. The final performance results for all ML1 benchmarks are presented in Table \ref{['tab:results']} and Table \ref{['tab:results-all']}. Shaded areas represent 95% confidence intervals.
  • Figure 3: First row: The effect of model-based planning at different scales. We present learning curves with varying beam sizes $K$ and sample sizes $L$. For the case labeled "No planning," the dynamics model is not utilized for planning, while the meta-model is still trained. The dashed vertical line marks the time step when planning begins, coinciding with the point where the context is fully filled. Second row: The effect of context lengths on the accuracy of the in-context learned dynamics model. Results are averaged over 3 train-test splits.
  • Figure : (a) Previous approaches
  • Figure : Meta-Training Phase
  • Figure : (a) DICP-AD
  • ...and 5 more figures