Table of Contents
Fetching ...

Reinforcement Learning from Delayed Observations via World Models

Armin Karamzade, Kyungmin Kim, Montek Kalsi, Roy Fox

TL;DR

This paper proposes leveraging world models, which have shown success in integrating past observations and learning dynamics, to handle observation delays in partially observable environments by reducing delayed POMDPs to delayed MDPs with world models.

Abstract

In standard reinforcement learning settings, agents typically assume immediate feedback about the effects of their actions after taking them. However, in practice, this assumption may not hold true due to physical constraints and can significantly impact the performance of learning algorithms. In this paper, we address observation delays in partially observable environments. We propose leveraging world models, which have shown success in integrating past observations and learning dynamics, to handle observation delays. By reducing delayed POMDPs to delayed MDPs with world models, our methods can effectively handle partial observability, where existing approaches achieve sub-optimal performance or degrade quickly as observability decreases. Experiments suggest that one of our methods can outperform a naive model-based approach by up to 250%. Moreover, we evaluate our methods on visual delayed environments, for the first time showcasing delay-aware reinforcement learning continuous control with visual observations.

Reinforcement Learning from Delayed Observations via World Models

TL;DR

This paper proposes leveraging world models, which have shown success in integrating past observations and learning dynamics, to handle observation delays in partially observable environments by reducing delayed POMDPs to delayed MDPs with world models.

Abstract

In standard reinforcement learning settings, agents typically assume immediate feedback about the effects of their actions after taking them. However, in practice, this assumption may not hold true due to physical constraints and can significantly impact the performance of learning algorithms. In this paper, we address observation delays in partially observable environments. We propose leveraging world models, which have shown success in integrating past observations and learning dynamics, to handle observation delays. By reducing delayed POMDPs to delayed MDPs with world models, our methods can effectively handle partial observability, where existing approaches achieve sub-optimal performance or degrade quickly as observability decreases. Experiments suggest that one of our methods can outperform a naive model-based approach by up to 250%. Moreover, we evaluate our methods on visual delayed environments, for the first time showcasing delay-aware reinforcement learning continuous control with visual observations.
Paper Structure (29 sections, 1 theorem, 6 equations, 7 figures, 3 tables)

This paper contains 29 sections, 1 theorem, 6 equations, 7 figures, 3 tables.

Key Result

Proposition 3

If a world model $\widehat{M}$ is congruent with a POMDP $\mathcal{P}_o$, then the $d$-step delayed world model $\widehat{M}^d$ is congruent with the $d$-step delayed DPOMDP $\mathcal{P}_o^d$.

Figures (7)

  • Figure 1: Panels (\ref{['fig:world-model']}) and (\ref{['fig:actor-critic']}) depict the standard Dreamer learning process, while (\ref{['fig:extended-diagram']}) and (\ref{['fig:latent-diagram']}) illustrate two strategies for adapting Dreamer for observation delays. (see section \ref{['sec:delay-aware']} and \ref{['sec:delayed-ac']})
  • Figure 2:
  • Figure 3: Normalized returns across different environments for varying delays. Bars and caps represent the mean and standard error of the mean over 5 trials, respectively. Panels (\ref{['fig:dmc-proprio-bars']}) and (\ref{['fig:dmc-vision-bars']}) are averaged over the selected suites in DMC, after normalizing the agent in the undelayed environment to 1 and the random policy to 0.
  • Figure 4: Return against the degree of observability in HalfCheeth-v4 for $d=5$.
  • Figure 5: Training curves for the set of tasks in Gym. Dreamer variants trained with 500K interactions of the environment, while D-TRPO and DC-AC used 5M and 1M interactions, respectively. For D-TRPO and DC-AC, we have plotted the final training performance.
  • ...and 2 more figures

Theorems & Definitions (4)

  • Definition 1
  • Definition 2
  • Proposition 3
  • proof