POPGym: Benchmarking Partially Observable Reinforcement Learning

Steven Morad; Ryan Kortvelesy; Matteo Bettini; Stephan Liwicki; Amanda Prorok

POPGym: Benchmarking Partially Observable Reinforcement Learning

Steven Morad, Ryan Kortvelesy, Matteo Bettini, Stephan Liwicki, Amanda Prorok

TL;DR

POP Gym addresses the gap in reinforcement learning benchmarks by providing 15 diverse, partially observable environments and 13 memory baselines implemented on RLlib, enabling large-scale, memory-focused comparisons. The work highlights a surprising disconnect between memory performance in supervised learning and RL, with classic RNNs (notably GRUs) often outperforming modern memory architectures, while Elman networks offer superior efficiency. It also shows that many navigation-centric POMDPs may not stress memory sufficiently and advocates for a broader suite of POMDP tasks to evaluate memory capabilities. The framework and results underscore the need for diverse, memory-aware benchmarks to drive progress in partial observability for real-world RL applications.

Abstract

Real world applications of Reinforcement Learning (RL) are often partially observable, thus requiring memory. Despite this, partial observability is still largely ignored by contemporary RL benchmarks and libraries. We introduce Partially Observable Process Gym (POPGym), a two-part library containing (1) a diverse collection of 15 partially observable environments, each with multiple difficulties and (2) implementations of 13 memory model baselines -- the most in a single RL library. Existing partially observable benchmarks tend to fixate on 3D visual navigation, which is computationally expensive and only one type of POMDP. In contrast, POPGym environments are diverse, produce smaller observations, use less memory, and often converge within two hours of training on a consumer-grade GPU. We implement our high-level memory API and memory baselines on top of the popular RLlib framework, providing plug-and-play compatibility with various training algorithms, exploration strategies, and distributed training paradigms. Using POPGym, we execute the largest comparison across RL memory models to date. POPGym is available at https://github.com/proroklab/popgym.

POPGym: Benchmarking Partially Observable Reinforcement Learning

TL;DR

Abstract

Paper Structure (24 sections, 14 figures, 2 tables)

This paper contains 24 sections, 14 figures, 2 tables.

Introduction
Contributions
Related Work
Fully and Near-Fully Observable Benchmarks
Partially Observable Benchmarks
Shortcomings of Current Benchmarks
Existing Memory Baselines
A Brief Summary on Memory
POPGym Environments
Environment Descriptions
POPGym Baselines
Experiments
Discussion
Supervised learning is a bad proxy for RL.
Use GRUs for performance and Elman nets for efficiency.
...and 9 more sections

Figures (14)

Figure 1: Renders from select POPGym environments.
Figure 2: Performance characteristics for POPGym memory baselines on random inputs. We use a recurrent state size of 256, a batch size of 64, and a episode length of 1024. We compute CPU statistics on a 3GHz Xeon Gold and GPU statistics on a 2080Ti, reporting the mean and 95% confidence interval over 10 trials. Train times correspond to a full batch while inference times are per-element (i.e. the latency to compute a single action). Note that GPU Train Time has logarithmic scale.
Figure 3: (Left) A summary comparison of baselines aggregated over all environments. We normalize the MMER such that 0 denotes the worst trial and 1 denotes the best trial for a specific environment. We report the interquartile range (box), median (horizontal line), and mean (dot) normalized MMER over all trials. (Right) Single value scores for each model, produced by meaning the MMER over all POPGym environments. We also provide scores with navigation (Labyrinth) environments excluded; the reasoning is provided in the discussion section.
Figure 4: Selected results used in the discussion section. We standardize the MMER from $[-1, 1]$ to $[0,1]$ for readability. The colored bars denote the mean and the black lines denote the 95% bootstrapped confidence interval. Full results across all environments are in \ref{['sec:exp_results']}
Figure 5: POPGym baselines.
...and 9 more figures

POPGym: Benchmarking Partially Observable Reinforcement Learning

TL;DR

Abstract

POPGym: Benchmarking Partially Observable Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (14)