Table of Contents
Fetching ...

Recurrent Model-Free RL Can Be a Strong Baseline for Many POMDPs

Tianwei Ni, Benjamin Eysenbach, Ruslan Salakhutdinov

TL;DR

The paper argues that recurrent model-free reinforcement learning, when carefully implemented and tuned, can serve as a strong baseline across a wide range of POMDP problems, rivaling specialized methods. It provides a comprehensive design space analysis, demonstrates superior sample efficiency in many benchmarks, and releases an efficient, reusable codebase. Through extensive ablations, the authors identify key factors—such as separate actor/critic encoders, informative inputs, backbones like TD3/SAC, and context length—that drive performance. The work suggests recurrent model-free RL is a practical, scalable baseline for POMDPs and invites automated design-driven enhancements in future work.

Abstract

Many problems in RL, such as meta-RL, robust RL, generalization in RL, and temporal credit assignment, can be cast as POMDPs. In theory, simply augmenting model-free RL with memory-based architectures, such as recurrent neural networks, provides a general approach to solving all types of POMDPs. However, prior work has found that such recurrent model-free RL methods tend to perform worse than more specialized algorithms that are designed for specific types of POMDPs. This paper revisits this claim. We find that careful architecture and hyperparameter decisions can often yield a recurrent model-free implementation that performs on par with (and occasionally substantially better than) more sophisticated recent techniques. We compare to 21 environments from 6 prior specialized methods and find that our implementation achieves greater sample efficiency and asymptotic performance than these methods on 18/21 environments. We also release a simple and efficient implementation of recurrent model-free RL for future work to use as a baseline for POMDPs.

Recurrent Model-Free RL Can Be a Strong Baseline for Many POMDPs

TL;DR

The paper argues that recurrent model-free reinforcement learning, when carefully implemented and tuned, can serve as a strong baseline across a wide range of POMDP problems, rivaling specialized methods. It provides a comprehensive design space analysis, demonstrates superior sample efficiency in many benchmarks, and releases an efficient, reusable codebase. Through extensive ablations, the authors identify key factors—such as separate actor/critic encoders, informative inputs, backbones like TD3/SAC, and context length—that drive performance. The work suggests recurrent model-free RL is a practical, scalable baseline for POMDPs and invites automated design-driven enhancements in future work.

Abstract

Many problems in RL, such as meta-RL, robust RL, generalization in RL, and temporal credit assignment, can be cast as POMDPs. In theory, simply augmenting model-free RL with memory-based architectures, such as recurrent neural networks, provides a general approach to solving all types of POMDPs. However, prior work has found that such recurrent model-free RL methods tend to perform worse than more specialized algorithms that are designed for specific types of POMDPs. This paper revisits this claim. We find that careful architecture and hyperparameter decisions can often yield a recurrent model-free implementation that performs on par with (and occasionally substantially better than) more sophisticated recent techniques. We compare to 21 environments from 6 prior specialized methods and find that our implementation achieves greater sample efficiency and asymptotic performance than these methods on 18/21 environments. We also release a simple and efficient implementation of recurrent model-free RL for future work to use as a baseline for POMDPs.

Paper Structure

This paper contains 61 sections, 1 equation, 22 figures, 7 tables.

Figures (22)

  • Figure 1: The importance of implementation for recurrent model-free RL. This paper identifies important design decisions for recurrent model-free RL. Our implementation outperforms prior implementations (e.g. PPO-GRU and A2C-GRU from kostrikov2018pytorch) and purpose-designed methods (e.g. VRM from han2019variational) on their respective POMDP benchmarks.
  • Figure 2: Learning curves on four meta-RL environments. Our implementation on recurrent model-free RL can surpass the specialized meta-RL method off-policy variBAD dorfman2020offline on their environments, Semi-Circle and Wind; and greatly outperform on-policy variBAD zintgraf2019varibad on their environment Cheetah-Dir, but fail to match their performance on Ant-Dir. On Cheetah-Dir and Ant-Dir, we show the learning curves of the best off-policy oracle and Markovian policies. We copied the data from on-policy variBAD's public github repository to plot the learning curves of it, oracle PPO and RL2 duan2016rl.
  • Figure 3: Learning curves on one robust RL environment, Cheetah-Robust. We show the average returns (left figure) and worst returns (right figure) of each method. The single best variant of our implementation on recurrent model-free RL can greatly outperform the specialized robust RL method MRPO jiang2021monotonic, and is more sample-efficient and stable than recurrent PPO.
  • Figure 4: Learning curves on RL in generalization in one environment, Hopper-Generalize. We show the interpolation success rates (left figure) and extrapolation success rates (right figure) of each method. The single best variant of our implementation on recurrent model-free RL can be par with the specialized method EPOpt-PPO-FF rajeswaran2016epopt in interpolation and outperform it in extrapolation. The data of EPOpt-PPO-FF and A2C-RC (a recurrent model-free on-policy RL method) are copied from the Table 7 & 8 in packer2018assessing.
  • Figure 5: Learning curves on two temporal credit assignment environments. We show the returns for Delayed-Catch and the success rates of opening the door for Key-to-Door, following the practice of IMPALA+SR raposo2021synthetic. The single best variant of our implementation on recurrent model-free RL is much more sample efficient than the specialized method IMPALA+SR (the horizontal lines show their performance at 2.5M and 4M steps, respectively).
  • ...and 17 more figures