Table of Contents
Fetching ...

DynaMITE-RL: A Dynamic Model for Improved Temporal Meta-Reinforcement Learning

Anthony Liang, Guy Tennenholtz, Chih-wei Hsu, Yinlam Chow, Erdem Bıyık, Craig Boutilier

TL;DR

DynaMITE-RL addresses RL tasks where latent context evolves across sessions by introducing the Dynamic Latent Contextual MDP (DLCMDP). It derives a variational objective and three key training principles—session-based consistency, latent-dynamics conditioning, and session-reconstruction masking—to efficiently infer changing latent context and adapt policies. Empirical results across Gridworld, MuJoCo tasks, and assistive robotics show significant gains in sample efficiency and returns over strong meta-RL baselines, in both online and offline settings. The work advances Bayes-like planning under dynamic latent contexts and points to broad applications in domains where user preferences or system dynamics drift over time. Key mathematical foundations include the DLCMDP transition $T(s_{t+1}, m_{t+1} \mid s_t, a_t, m_t)$ with latent changes governed by $T_m(m' \mid m)$ and session terminations $d_t$, and the ELBO-based objective integrating $q_{\phi}$ and $p_{\theta}$ components.

Abstract

We introduce DynaMITE-RL, a meta-reinforcement learning (meta-RL) approach to approximate inference in environments where the latent state evolves at varying rates. We model episode sessions - parts of the episode where the latent state is fixed - and propose three key modifications to existing meta-RL methods: consistency of latent information within sessions, session masking, and prior latent conditioning. We demonstrate the importance of these modifications in various domains, ranging from discrete Gridworld environments to continuous-control and simulated robot assistive tasks, demonstrating that DynaMITE-RL significantly outperforms state-of-the-art baselines in sample efficiency and inference returns.

DynaMITE-RL: A Dynamic Model for Improved Temporal Meta-Reinforcement Learning

TL;DR

DynaMITE-RL addresses RL tasks where latent context evolves across sessions by introducing the Dynamic Latent Contextual MDP (DLCMDP). It derives a variational objective and three key training principles—session-based consistency, latent-dynamics conditioning, and session-reconstruction masking—to efficiently infer changing latent context and adapt policies. Empirical results across Gridworld, MuJoCo tasks, and assistive robotics show significant gains in sample efficiency and returns over strong meta-RL baselines, in both online and offline settings. The work advances Bayes-like planning under dynamic latent contexts and points to broad applications in domains where user preferences or system dynamics drift over time. Key mathematical foundations include the DLCMDP transition with latent changes governed by and session terminations , and the ELBO-based objective integrating and components.

Abstract

We introduce DynaMITE-RL, a meta-reinforcement learning (meta-RL) approach to approximate inference in environments where the latent state evolves at varying rates. We model episode sessions - parts of the episode where the latent state is fixed - and propose three key modifications to existing meta-RL methods: consistency of latent information within sessions, session masking, and prior latent conditioning. We demonstrate the importance of these modifications in various domains, ranging from discrete Gridworld environments to continuous-control and simulated robot assistive tasks, demonstrating that DynaMITE-RL significantly outperforms state-of-the-art baselines in sample efficiency and inference returns.
Paper Structure (19 sections, 16 equations, 8 figures, 5 tables, 3 algorithms)

This paper contains 19 sections, 16 equations, 8 figures, 5 tables, 3 algorithms.

Figures (8)

  • Figure 1: (Left) The graphical model for a DLCMDP. The transition dynamics of the environment follows $T(s_{t+1}, m_{t+1} \mid s_{t}, a_{t}, m_{t})$. At every timestep $t$, an i.i.d. Bernoulli random variable, $d_{t}$, denotes the change in the latent context, $m_{t}$. Blue shaded variables are observed and white shaded variables are latent. (Right) A DLCMDP rollout. Each session $i$ is governed by a latent variable $m^{i}$ which is changing between sessions according to a fixed transition function, $T_m(m'\mid m)$. We denote $l_{i}$ as the length of session $i$. The state-action pair $(s_{t}^{i}, a_{t}^{i})$ at timestep $t$ in session $i$ is summarized into a single observed variable, $x_{t}^{i}$. We emphasize that session terminations are not explicitly observed.
  • Figure 2: VariBAD does not model the latent context dynamics and fails to adapt to the changing goal location. By contrast, DynaMITE-RL correctly infers the transition and consistently reaches the rewarding cell (green cross).
  • Figure 3: DynaMITE-RL Training
  • Figure 4: The environments considered in evaluating DynaMITE-RL. Each environment exhibits some change in reward and/or dynamics between sessions including changing goal locations (left and middle left), changing target velocities (middle right), and evolving user preferences of itch location (right).
  • Figure 5: Learning curves for DynaMITE-RL and state-of-the-art baseline methods. Shaded areas represent standard deviation over 5 different random seeds for each method and 3 for ScratchItch. In each of the evaluation environments, we observe that DynaMITE-RL exhibits better sample efficiency and converges to a policy with better environment returns than the baseline methods.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Remark 4.1