Table of Contents
Fetching ...

Filtering Learning Histories Enhances In-Context Reinforcement Learning

Weiqin Chen, Xinjie Zhang, Dharmashankar Subramanian, Santiago Paternain

TL;DR

This work tackles the suboptimality inherited by in-context reinforcement learning (ICRL) when transformers imitate complete learning histories from a source RL algorithm. It introduces Learning History Filtering (LHF), a dataset preprocessing technique that reweights and filters histories using a unified score U(D) = Improvement + λ · Stability, inspired by weighted empirical risk minimization. LHF is a plug-in method compatible with state-of-the-art ICRL backbones (AD, DPT, DICP) and demonstrates robust performance gains on Darkroom-type benchmarks and Meta-World-ML1, especially under noisy data, partial histories, and lightweight models. The results underscore the value of data-centric interventions to improve ICRL and offer a pathway toward more reliable zero-shot generalization in diverse environments.

Abstract

Transformer models (TMs) have exhibited remarkable in-context reinforcement learning (ICRL) capabilities, allowing them to generalize to and improve in previously unseen environments without re-training or fine-tuning. This is typically accomplished by imitating the complete learning histories of a source RL algorithm over a substantial amount of pretraining environments, which, however, may transfer suboptimal behaviors inherited from the source algorithm/dataset. Therefore, in this work, we address the issue of inheriting suboptimality from the perspective of dataset preprocessing. Motivated by the success of the weighted empirical risk minimization, we propose a simple yet effective approach, learning history filtering (LHF), to enhance ICRL by reweighting and filtering the learning histories based on their improvement and stability characteristics. To the best of our knowledge, LHF is the first approach to avoid source suboptimality by dataset preprocessing, and can be combined with the current state-of-the-art (SOTA) ICRL algorithms. We substantiate the effectiveness of LHF through a series of experiments conducted on the well-known ICRL benchmarks, encompassing both discrete environments and continuous robotic manipulation tasks, with three SOTA ICRL algorithms (AD, DPT, DICP) as the backbones. LHF exhibits robust performance across a variety of suboptimal scenarios, as well as under varying hyperparameters and sampling strategies. Notably, the superior performance of LHF becomes more pronounced in the presence of noisy data, indicating the significance of filtering learning histories.

Filtering Learning Histories Enhances In-Context Reinforcement Learning

TL;DR

This work tackles the suboptimality inherited by in-context reinforcement learning (ICRL) when transformers imitate complete learning histories from a source RL algorithm. It introduces Learning History Filtering (LHF), a dataset preprocessing technique that reweights and filters histories using a unified score U(D) = Improvement + λ · Stability, inspired by weighted empirical risk minimization. LHF is a plug-in method compatible with state-of-the-art ICRL backbones (AD, DPT, DICP) and demonstrates robust performance gains on Darkroom-type benchmarks and Meta-World-ML1, especially under noisy data, partial histories, and lightweight models. The results underscore the value of data-centric interventions to improve ICRL and offer a pathway toward more reliable zero-shot generalization in diverse environments.

Abstract

Transformer models (TMs) have exhibited remarkable in-context reinforcement learning (ICRL) capabilities, allowing them to generalize to and improve in previously unseen environments without re-training or fine-tuning. This is typically accomplished by imitating the complete learning histories of a source RL algorithm over a substantial amount of pretraining environments, which, however, may transfer suboptimal behaviors inherited from the source algorithm/dataset. Therefore, in this work, we address the issue of inheriting suboptimality from the perspective of dataset preprocessing. Motivated by the success of the weighted empirical risk minimization, we propose a simple yet effective approach, learning history filtering (LHF), to enhance ICRL by reweighting and filtering the learning histories based on their improvement and stability characteristics. To the best of our knowledge, LHF is the first approach to avoid source suboptimality by dataset preprocessing, and can be combined with the current state-of-the-art (SOTA) ICRL algorithms. We substantiate the effectiveness of LHF through a series of experiments conducted on the well-known ICRL benchmarks, encompassing both discrete environments and continuous robotic manipulation tasks, with three SOTA ICRL algorithms (AD, DPT, DICP) as the backbones. LHF exhibits robust performance across a variety of suboptimal scenarios, as well as under varying hyperparameters and sampling strategies. Notably, the superior performance of LHF becomes more pronounced in the presence of noisy data, indicating the significance of filtering learning histories.

Paper Structure

This paper contains 30 sections, 13 equations, 8 figures, 8 tables, 2 algorithms.

Figures (8)

  • Figure 1: The schematic of learning history filtering (LHF). Current ICRL methods employ a source RL algorithm (e.g., PPO) to collect the learning histories across a substantial amount of environments, resulting in a pretraining dataset composed of multiple learning histories with varying levels of performance (left). LHF filters such pretraining dataset and randomly retains each learning history with different probabilities that depend on the improvement and stability characteristics inherent in the learning histories. As a result, high-quality learning histories (A, B, D) are more likely to be retained with varying proportions, while suboptimal ones (C, E) tend to be filtered out (middle). After filtering learning histories, we follow the standard process for pretraining transformer models (right).
  • Figure 2: Learning curves of our LHF approach (solid lines) compared with original baselines (dashed lines) during the test. Each algorithm contains three independent runs with mean and standard deviation. The backbone algorithms include AD (red), DICP (blue), and DPT (green).
  • Figure 3: Learning curves of our LHF approach (solid lines) compared with original baselines (dashed lines) during the test. Each algorithm contains three independent runs with mean and std., provided with the noisy dataset. The backbone algorithms include AD (red), DICP (blue), and DPT (green).
  • Figure 4: Learning curves of our LHF approach (solid lines) compared with original baselines (dashed lines) during the test. Each algorithm contains three independent runs with mean and std., provided with Meta-World-ML1 environments. The backbone algorithms include AD (red) and DICP (blue).
  • Figure 5: Learning curves of our LHF approach (solid lines) compared with original baselines (dashed lines) during the test. Each algorithm contains three independent runs with mean and std., provided with different stability coefficient $\lambda$ ((a) and (b)) and different temperature coefficient $\alpha$ in the Softmax sampling strategy ((c) and (d)). The backbone algorithms include AD and DICP.
  • ...and 3 more figures