Table of Contents
Fetching ...

Variational Dynamic for Self-Supervised Exploration in Deep Reinforcement Learning

Chenjia Bai, Peng Liu, Kaiyu Liu, Lingxiao Wang, Yingnan Zhao, Lei Han

TL;DR

This work tackles exploration in reinforcement learning under sparse extrinsic rewards by introducing Variational Dynamic Model (VDM), a conditional generative model p(s'|s,a,z) that encodes multimodality and stochasticity via a latent variable z drawn from a learnable prior p(z|s,a) and inferred by q(z|s,a,s'). The model is trained by maximizing the variational lower bound L_VDM = E_{q}[log p_theta(s'|s,a,z)] - D_KL[q||p], and the agent's intrinsic reward is an upper-bound estimate of -log p(s'|s,a) computed from sampled latents, guiding self-supervised exploration. Empirical results across Atari, sticky-Atari, Super Mario, two-player Pong, and a real robotic task show VDM improves exploration efficiency and robustness, outperforming ICM, RFM, and Disagreement, with notable advantage in multimodal environments. The work includes theoretical and empirical comparisons to CVAE, demonstrating that conditioning the prior on (s,a) yields a tighter bound and better dynamics modeling. The findings suggest that variational dynamics with intrinsic rewards can enable scalable, self-supervised exploration in real-world, complex RL settings.

Abstract

Efficient exploration remains a challenging problem in reinforcement learning, especially for tasks where extrinsic rewards from environments are sparse or even totally disregarded. Significant advances based on intrinsic motivation show promising results in simple environments but often get stuck in environments with multimodal and stochastic dynamics. In this work, we propose a variational dynamic model based on the conditional variational inference to model the multimodality and stochasticity. We consider the environmental state-action transition as a conditional generative process by generating the next-state prediction under the condition of the current state, action, and latent variable, which provides a better understanding of the dynamics and leads a better performance in exploration. We derive an upper bound of the negative log-likelihood of the environmental transition and use such an upper bound as the intrinsic reward for exploration, which allows the agent to learn skills by self-supervised exploration without observing extrinsic rewards. We evaluate the proposed method on several image-based simulation tasks and a real robotic manipulating task. Our method outperforms several state-of-the-art environment model-based exploration approaches.

Variational Dynamic for Self-Supervised Exploration in Deep Reinforcement Learning

TL;DR

This work tackles exploration in reinforcement learning under sparse extrinsic rewards by introducing Variational Dynamic Model (VDM), a conditional generative model p(s'|s,a,z) that encodes multimodality and stochasticity via a latent variable z drawn from a learnable prior p(z|s,a) and inferred by q(z|s,a,s'). The model is trained by maximizing the variational lower bound L_VDM = E_{q}[log p_theta(s'|s,a,z)] - D_KL[q||p], and the agent's intrinsic reward is an upper-bound estimate of -log p(s'|s,a) computed from sampled latents, guiding self-supervised exploration. Empirical results across Atari, sticky-Atari, Super Mario, two-player Pong, and a real robotic task show VDM improves exploration efficiency and robustness, outperforming ICM, RFM, and Disagreement, with notable advantage in multimodal environments. The work includes theoretical and empirical comparisons to CVAE, demonstrating that conditioning the prior on (s,a) yields a tighter bound and better dynamics modeling. The findings suggest that variational dynamics with intrinsic rewards can enable scalable, self-supervised exploration in real-world, complex RL settings.

Abstract

Efficient exploration remains a challenging problem in reinforcement learning, especially for tasks where extrinsic rewards from environments are sparse or even totally disregarded. Significant advances based on intrinsic motivation show promising results in simple environments but often get stuck in environments with multimodal and stochastic dynamics. In this work, we propose a variational dynamic model based on the conditional variational inference to model the multimodality and stochasticity. We consider the environmental state-action transition as a conditional generative process by generating the next-state prediction under the condition of the current state, action, and latent variable, which provides a better understanding of the dynamics and leads a better performance in exploration. We derive an upper bound of the negative log-likelihood of the environmental transition and use such an upper bound as the intrinsic reward for exploration, which allows the agent to learn skills by self-supervised exploration without observing extrinsic rewards. We evaluate the proposed method on several image-based simulation tasks and a real robotic manipulating task. Our method outperforms several state-of-the-art environment model-based exploration approaches.

Paper Structure

This paper contains 27 sections, 1 theorem, 25 equations, 16 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

It holds for all positive integers $m \leq k$ that Moreover, if $w_i$ is bounded, then $r^{i}_{k}$ converges to $r^{i}$ as $k\to+\infty$.

Figures (16)

  • Figure 1: An intuitive example for latent variables in dynamics. We model the multimodality and stochasticity of the dynamics explicitly through latent variables (i.e., $z_1$ and $z_2$) for exploration.
  • Figure 2: MDP of the 'Noisy-Mnist'. The state of digit '0' always moves to digit '1', and state of digit '1' moves to other digits with equal probability. The content in the circle represents the latent variables of Noisy-Mnist.
  • Figure 3: VDM architecture. The model contains a posterior network, a prior network and a generative network. The diagonal Gaussian is used as the output of each network. The objective function $L_{\rm VDM}$ contains KL-divergence and reconstruction loss. A random CNN is used for feature extraction.
  • Figure 4: Result of the probabilistic-ensemble dynamic model in 'Noisy-Mnist'. (a) When we input an image of the digit '0', three images are generated from different models. Different models all generate the correct prediction of image class but lacks the diversity of writing styles. (b) When we input an image of the digit '1', the ensemble-based model tends to average the various reasonable predictions and generate blurred images.
  • Figure 5: Result of VDM in 'Noisy-Mnist'. (a) When we input an image of digit '0', we sample 10 latent variables $\{\mathbf{z_1},...,\mathbf{z_{10}}\}$ and generate a next-state prediction for each one. VDM generates digit '1' with different writing styles. (b) When we input an image of digit '1', we sample 100 latent variables $\{\mathbf{z_1},...,\mathbf{z_{100}}\}$ and 100 next-state predictions are generated. The digits '2' to '9' are sampled with almost equal probability in various writing styles.
  • ...and 11 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof