DynaMITE-RL: A Dynamic Model for Improved Temporal Meta-Reinforcement Learning
Anthony Liang, Guy Tennenholtz, Chih-wei Hsu, Yinlam Chow, Erdem Bıyık, Craig Boutilier
TL;DR
DynaMITE-RL addresses RL tasks where latent context evolves across sessions by introducing the Dynamic Latent Contextual MDP (DLCMDP). It derives a variational objective and three key training principles—session-based consistency, latent-dynamics conditioning, and session-reconstruction masking—to efficiently infer changing latent context and adapt policies. Empirical results across Gridworld, MuJoCo tasks, and assistive robotics show significant gains in sample efficiency and returns over strong meta-RL baselines, in both online and offline settings. The work advances Bayes-like planning under dynamic latent contexts and points to broad applications in domains where user preferences or system dynamics drift over time. Key mathematical foundations include the DLCMDP transition $T(s_{t+1}, m_{t+1} \mid s_t, a_t, m_t)$ with latent changes governed by $T_m(m' \mid m)$ and session terminations $d_t$, and the ELBO-based objective integrating $q_{\phi}$ and $p_{\theta}$ components.
Abstract
We introduce DynaMITE-RL, a meta-reinforcement learning (meta-RL) approach to approximate inference in environments where the latent state evolves at varying rates. We model episode sessions - parts of the episode where the latent state is fixed - and propose three key modifications to existing meta-RL methods: consistency of latent information within sessions, session masking, and prior latent conditioning. We demonstrate the importance of these modifications in various domains, ranging from discrete Gridworld environments to continuous-control and simulated robot assistive tasks, demonstrating that DynaMITE-RL significantly outperforms state-of-the-art baselines in sample efficiency and inference returns.
