DynaMITE-RL: A Dynamic Model for Improved Temporal Meta-Reinforcement Learning

Anthony Liang; Guy Tennenholtz; Chih-wei Hsu; Yinlam Chow; Erdem Bıyık; Craig Boutilier

DynaMITE-RL: A Dynamic Model for Improved Temporal Meta-Reinforcement Learning

Anthony Liang, Guy Tennenholtz, Chih-wei Hsu, Yinlam Chow, Erdem Bıyık, Craig Boutilier

TL;DR

DynaMITE-RL addresses RL tasks where latent context evolves across sessions by introducing the Dynamic Latent Contextual MDP (DLCMDP). It derives a variational objective and three key training principles—session-based consistency, latent-dynamics conditioning, and session-reconstruction masking—to efficiently infer changing latent context and adapt policies. Empirical results across Gridworld, MuJoCo tasks, and assistive robotics show significant gains in sample efficiency and returns over strong meta-RL baselines, in both online and offline settings. The work advances Bayes-like planning under dynamic latent contexts and points to broad applications in domains where user preferences or system dynamics drift over time. Key mathematical foundations include the DLCMDP transition $T(s_{t+1}, m_{t+1} \mid s_t, a_t, m_t)$ with latent changes governed by $T_m(m' \mid m)$ and session terminations $d_t$, and the ELBO-based objective integrating $q_{\phi}$ and $p_{\theta}$ components.

Abstract

We introduce DynaMITE-RL, a meta-reinforcement learning (meta-RL) approach to approximate inference in environments where the latent state evolves at varying rates. We model episode sessions - parts of the episode where the latent state is fixed - and propose three key modifications to existing meta-RL methods: consistency of latent information within sessions, session masking, and prior latent conditioning. We demonstrate the importance of these modifications in various domains, ranging from discrete Gridworld environments to continuous-control and simulated robot assistive tasks, demonstrating that DynaMITE-RL significantly outperforms state-of-the-art baselines in sample efficiency and inference returns.

DynaMITE-RL: A Dynamic Model for Improved Temporal Meta-Reinforcement Learning

TL;DR

with latent changes governed by

and session terminations

, and the ELBO-based objective integrating

and

components.

Abstract

Paper Structure (19 sections, 16 equations, 8 figures, 5 tables, 3 algorithms)

This paper contains 19 sections, 16 equations, 8 figures, 5 tables, 3 algorithms.

Introduction
Background
Dynamic Latent Contextual MDPs
DynaMITE-RL
Experiments
Related Work
Conclusion
Appendix / supplemental material
ELBO Derivation for DLCMDP
Pseudocode for DynaMITE-RL
Additional Experimental Results
Evaluation Environment Description
Gridworld Navigation with Alternating Goals
MuJoCo Continuous Control
Assistive Gym
...and 4 more sections

Figures (8)

Figure 1: (Left) The graphical model for a DLCMDP. The transition dynamics of the environment follows $T(s_{t+1}, m_{t+1} \mid s_{t}, a_{t}, m_{t})$. At every timestep $t$, an i.i.d. Bernoulli random variable, $d_{t}$, denotes the change in the latent context, $m_{t}$. Blue shaded variables are observed and white shaded variables are latent. (Right) A DLCMDP rollout. Each session $i$ is governed by a latent variable $m^{i}$ which is changing between sessions according to a fixed transition function, $T_m(m'\mid m)$. We denote $l_{i}$ as the length of session $i$. The state-action pair $(s_{t}^{i}, a_{t}^{i})$ at timestep $t$ in session $i$ is summarized into a single observed variable, $x_{t}^{i}$. We emphasize that session terminations are not explicitly observed.
Figure 2: VariBAD does not model the latent context dynamics and fails to adapt to the changing goal location. By contrast, DynaMITE-RL correctly infers the transition and consistently reaches the rewarding cell (green cross).
Figure 3: DynaMITE-RL Training
Figure 4: The environments considered in evaluating DynaMITE-RL. Each environment exhibits some change in reward and/or dynamics between sessions including changing goal locations (left and middle left), changing target velocities (middle right), and evolving user preferences of itch location (right).
Figure 5: Learning curves for DynaMITE-RL and state-of-the-art baseline methods. Shaded areas represent standard deviation over 5 different random seeds for each method and 3 for ScratchItch. In each of the evaluation environments, we observe that DynaMITE-RL exhibits better sample efficiency and converges to a policy with better environment returns than the baseline methods.
...and 3 more figures

Theorems & Definitions (1)

Remark 4.1

DynaMITE-RL: A Dynamic Model for Improved Temporal Meta-Reinforcement Learning

TL;DR

Abstract

DynaMITE-RL: A Dynamic Model for Improved Temporal Meta-Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)

Theorems & Definitions (1)