Table of Contents
Fetching ...

The Generalization Gap in Offline Reinforcement Learning

Ishita Mediratta, Qingfei You, Minqi Jiang, Roberta Raileanu

TL;DR

This work investigates how well online and offline reinforcement learning methods generalize to unseen environments. By introducing Procgen and WebShop-based benchmarks, it shows that offline approaches, including BCQ, CQL, IQL, and transformer-based variants, underperform online PPO in zero-shot transfer, with BC often providing the strongest baseline. A central finding is that increasing data diversity across environments improves generalization far more than simply increasing dataset size. The study highlights the need for new offline methods and data collection strategies that explicitly optimize cross-environment robustness, and it provides open-source datasets and baselines to spur future research.

Abstract

Despite recent progress in offline learning, these methods are still trained and tested on the same environment. In this paper, we compare the generalization abilities of widely used online and offline learning methods such as online reinforcement learning (RL), offline RL, sequence modeling, and behavioral cloning. Our experiments show that offline learning algorithms perform worse on new environments than online learning ones. We also introduce the first benchmark for evaluating generalization in offline learning, collecting datasets of varying sizes and skill-levels from Procgen (2D video games) and WebShop (e-commerce websites). The datasets contain trajectories for a limited number of game levels or natural language instructions and at test time, the agent has to generalize to new levels or instructions. Our experiments reveal that existing offline learning algorithms struggle to match the performance of online RL on both train and test environments. Behavioral cloning is a strong baseline, outperforming state-of-the-art offline RL and sequence modeling approaches when trained on data from multiple environments and tested on new ones. Finally, we find that increasing the diversity of the data, rather than its size, improves performance on new environments for all offline learning algorithms. Our study demonstrates the limited generalization of current offline learning algorithms highlighting the need for more research in this area.

The Generalization Gap in Offline Reinforcement Learning

TL;DR

This work investigates how well online and offline reinforcement learning methods generalize to unseen environments. By introducing Procgen and WebShop-based benchmarks, it shows that offline approaches, including BCQ, CQL, IQL, and transformer-based variants, underperform online PPO in zero-shot transfer, with BC often providing the strongest baseline. A central finding is that increasing data diversity across environments improves generalization far more than simply increasing dataset size. The study highlights the need for new offline methods and data collection strategies that explicitly optimize cross-environment robustness, and it provides open-source datasets and baselines to spur future research.

Abstract

Despite recent progress in offline learning, these methods are still trained and tested on the same environment. In this paper, we compare the generalization abilities of widely used online and offline learning methods such as online reinforcement learning (RL), offline RL, sequence modeling, and behavioral cloning. Our experiments show that offline learning algorithms perform worse on new environments than online learning ones. We also introduce the first benchmark for evaluating generalization in offline learning, collecting datasets of varying sizes and skill-levels from Procgen (2D video games) and WebShop (e-commerce websites). The datasets contain trajectories for a limited number of game levels or natural language instructions and at test time, the agent has to generalize to new levels or instructions. Our experiments reveal that existing offline learning algorithms struggle to match the performance of online RL on both train and test environments. Behavioral cloning is a strong baseline, outperforming state-of-the-art offline RL and sequence modeling approaches when trained on data from multiple environments and tested on new ones. Finally, we find that increasing the diversity of the data, rather than its size, improves performance on new environments for all offline learning algorithms. Our study demonstrates the limited generalization of current offline learning algorithms highlighting the need for more research in this area.
Paper Structure (49 sections, 29 figures, 8 tables)

This paper contains 49 sections, 29 figures, 8 tables.

Figures (29)

  • Figure 1: (a) Sample screenshots from the train and test environments of four Procgen games. (b) Sample instructions (item descriptions) from the train and test set of human demonstrations from WebShop. Red and blue highlight the type and attributes of the desired item, respectively.
  • Figure 2: Performance on Procgen 1M Expert Dataset. Train and test min-max normalized returns aggregated across all 16 Procgen games, when trained on expert demonstrations. Each method was evaluated online across 100 episodes on levels sampled uniformly from the test set. The IQM aggregate metric is computed over 5 model seeds, with the error bars representing upper (75th) and lower (25th) interval estimates. BC outperforms all offline RL and sequence modelling approaches on both train and test environments. All offline learning methods lag behind online RL on both train and test.
  • Figure 3: Performance on Procgen 1M Mixed Expert-Suboptimal Dataset. Train and test min-max normalized returns aggregated across all 16 Procgen games, when trained on mixed expert-suboptimal demonstrations. Each method was evaluated online across 100 episodes on levels sampled uniformly from the test set. The IQM aggregate metric is computed over 3 model seeds, with the error bars representing upper (75th) and lower (25th) interval estimates. Similar to Figure \ref{['fig:procgen_1m_expert_iqm']}, BC outperforms all offline RL and sequence modelling approaches on both train and test environments. All offline learning methods lag behind online RL on both train and test.
  • Figure 4: Performance of each baseline across selected Procgen games when trained and tested on the same level using expert and suboptimal dataset. Here we report performance on selected levels: Chaser, Coinrun, Jumper, and Leaper. For all games, refer to Figures \ref{['fig:procgen_expert_level40_all_games']} and \ref{['fig:procgen_subop_level40_all_games']} in Appendix \ref{['sec:single_level_procgen']}.
  • Figure 5: The Effect of Data Diversity on Performance. Train and test performance of offline learning algorithms for varying number of training levels in the 1M expert datasets, aggregated across all Procgen games. The plot shows the IQM and error bars represent the 75-th and 25th percentiles computed over 3 model seeds. While the training performance doesn't change much with the number of training levels, the test performance increases (and generalization gap decreases) with the diversity of the dataset.
  • ...and 24 more figures