Table of Contents
Fetching ...

Experimental Study on The Effect of Multi-step Deep Reinforcement Learning in POMDPs

Lingheng Meng, Rob Gorbet, Michael Burke, Dana Kulić

TL;DR

The study tackles the challenge of applying DRL methods designed for fully observable MDPs to partially observable POMDPs in robotics. It contrasts PPO, TD3, and SAC across MDP- and POMDP-inspired environments, uncovering an unexpected result where PPO often outperforms TD3/SAC under partial observability. The authors show that multi-step bootstrapping (MTD3, MSAC) markedly improves robustness in POMDPs, generalizing across tasks, and they analyze the underlying factors through hypotheses about temporal information propagation and exploration. The work demonstrates practical implications for selecting training algorithms in real-world, sensor-noisy robotics and proposes directions for future research, including more principled POMDP benchmarks and methods that either bridge MDP and POMDP solvers or harness multi-step information more effectively.

Abstract

Deep Reinforcement Learning (DRL) has made tremendous advances in both simulated and real-world robot control tasks in recent years. This is particularly the case for tasks that can be carefully engineered with a full state representation, and which can then be formulated as a Markov Decision Process (MDP). However, applying DRL strategies designed for MDPs to novel robot control tasks can be challenging, because the available observations may be a partial representation of the state, resulting in a Partially Observable Markov Decision Process (POMDP). This paper considers three popular DRL algorithms, namely Proximal Policy Optimization (PPO), Twin Delayed Deep Deterministic Policy Gradient (TD3), and Soft Actor-Critic (SAC), invented for MDPs, and studies their performance in POMDP scenarios. While prior work has found that SAC and TD3 typically outperform PPO across a broad range of tasks that can be represented as MDPs, we show that this is not always the case, using three representative POMDP environments. Empirical studies show that this is related to multi-step bootstrapping, where multi-step immediate rewards, instead of one-step immediate reward, are used to calculate the target value estimation of an observation and action pair. We identify this by observing that the inclusion of multi-step bootstrapping in TD3 (MTD3) and SAC (MSAC) results in improved robustness in POMDP settings.

Experimental Study on The Effect of Multi-step Deep Reinforcement Learning in POMDPs

TL;DR

The study tackles the challenge of applying DRL methods designed for fully observable MDPs to partially observable POMDPs in robotics. It contrasts PPO, TD3, and SAC across MDP- and POMDP-inspired environments, uncovering an unexpected result where PPO often outperforms TD3/SAC under partial observability. The authors show that multi-step bootstrapping (MTD3, MSAC) markedly improves robustness in POMDPs, generalizing across tasks, and they analyze the underlying factors through hypotheses about temporal information propagation and exploration. The work demonstrates practical implications for selecting training algorithms in real-world, sensor-noisy robotics and proposes directions for future research, including more principled POMDP benchmarks and methods that either bridge MDP and POMDP solvers or harness multi-step information more effectively.

Abstract

Deep Reinforcement Learning (DRL) has made tremendous advances in both simulated and real-world robot control tasks in recent years. This is particularly the case for tasks that can be carefully engineered with a full state representation, and which can then be formulated as a Markov Decision Process (MDP). However, applying DRL strategies designed for MDPs to novel robot control tasks can be challenging, because the available observations may be a partial representation of the state, resulting in a Partially Observable Markov Decision Process (POMDP). This paper considers three popular DRL algorithms, namely Proximal Policy Optimization (PPO), Twin Delayed Deep Deterministic Policy Gradient (TD3), and Soft Actor-Critic (SAC), invented for MDPs, and studies their performance in POMDP scenarios. While prior work has found that SAC and TD3 typically outperform PPO across a broad range of tasks that can be represented as MDPs, we show that this is not always the case, using three representative POMDP environments. Empirical studies show that this is related to multi-step bootstrapping, where multi-step immediate rewards, instead of one-step immediate reward, are used to calculate the target value estimation of an observation and action pair. We identify this by observing that the inclusion of multi-step bootstrapping in TD3 (MTD3) and SAC (MSAC) results in improved robustness in POMDP settings.
Paper Structure (25 sections, 12 equations, 23 figures, 8 tables)

This paper contains 25 sections, 12 equations, 23 figures, 8 tables.

Figures (23)

  • Figure 1: Performance of PPO, TD3 and SAC on Walker2D, where (a) shows the Walker2D robot, (b) shows the performance of the three DRL algorithms on the original task, and (c) shows their performance on the modified task with random noise in the task observation.
  • Figure 2: Information incorporated in $n$-step Bootstrapping, where the $n$ immediate rewards and the bootstrapped value, whose calculation depends on what value function is available, thereafter are included.
  • Figure 3: Benchmark Tasks, where (a) Ant-v2, (b) HalfCheetah-v2, (c) Hopper-v2, and (d) Walker2D-v2.
  • Figure 4: Bar Chart of Average Return Over Tasks in Table \ref{['tab:Maximum_of_Average_Return']}
  • Figure 5: Effect of Multi-step Size on The Performance of MTD3 and MSAC, where the average learning curves correspond to MTD3($n$) and MSAC($n$) with different multi-step sizes $n$ and the shaded area shows half of standard deviation of the average accumulated return over 3 random seeds.
  • ...and 18 more figures