Table of Contents
Fetching ...

Efficient Model-Based Reinforcement Learning Through Optimistic Thompson Sampling

Jasmine Bayrooti, Carl Henrik Ek, Amanda Prorok

TL;DR

The paper addresses sample inefficiency in robotic reinforcement learning by introducing HOT-GP, a practical model-based method that uses a joint reward-state Gaussian Process to capture correlations between transitions and rewards. Optimistic exploration is achieved via Thompson sampling conditioned on high-reward outcomes, with a Kronecker-structured covariance and an MLP mean to model complex dynamics and rewards. Empirically, HOT-GP matches or surpasses strong baselines on MuJoCo and VMAS tasks, especially in sparse-reward and action-penalized settings, and ablations reveal the critical role of joint uncertainty modeling and the sampling scheme. The work demonstrates that explicit joint uncertainty modeling can substantially enhance exploration efficiency, with potential for broader application in real-world robotics.

Abstract

Learning complex robot behavior through interactions with the environment necessitates principled exploration. Effective strategies should prioritize exploring regions of the state-action space that maximize rewards, with optimistic exploration emerging as a promising direction aligned with this idea and enabling sample-efficient reinforcement learning. However, existing methods overlook a crucial aspect: the need for optimism to be informed by a belief connecting the reward and state. To address this, we propose a practical, theoretically grounded approach to optimistic exploration based on Thompson sampling. Our model structure is the first that allows for reasoning about joint uncertainty over transitions and rewards. We apply our method on a set of MuJoCo and VMAS continuous control tasks. Our experiments demonstrate that optimistic exploration significantly accelerates learning in environments with sparse rewards, action penalties, and difficult-to-explore regions. Furthermore, we provide insights into when optimism is beneficial and emphasize the critical role of model uncertainty in guiding exploration.

Efficient Model-Based Reinforcement Learning Through Optimistic Thompson Sampling

TL;DR

The paper addresses sample inefficiency in robotic reinforcement learning by introducing HOT-GP, a practical model-based method that uses a joint reward-state Gaussian Process to capture correlations between transitions and rewards. Optimistic exploration is achieved via Thompson sampling conditioned on high-reward outcomes, with a Kronecker-structured covariance and an MLP mean to model complex dynamics and rewards. Empirically, HOT-GP matches or surpasses strong baselines on MuJoCo and VMAS tasks, especially in sparse-reward and action-penalized settings, and ablations reveal the critical role of joint uncertainty modeling and the sampling scheme. The work demonstrates that explicit joint uncertainty modeling can substantially enhance exploration efficiency, with potential for broader application in real-world robotics.

Abstract

Learning complex robot behavior through interactions with the environment necessitates principled exploration. Effective strategies should prioritize exploring regions of the state-action space that maximize rewards, with optimistic exploration emerging as a promising direction aligned with this idea and enabling sample-efficient reinforcement learning. However, existing methods overlook a crucial aspect: the need for optimism to be informed by a belief connecting the reward and state. To address this, we propose a practical, theoretically grounded approach to optimistic exploration based on Thompson sampling. Our model structure is the first that allows for reasoning about joint uncertainty over transitions and rewards. We apply our method on a set of MuJoCo and VMAS continuous control tasks. Our experiments demonstrate that optimistic exploration significantly accelerates learning in environments with sparse rewards, action penalties, and difficult-to-explore regions. Furthermore, we provide insights into when optimism is beneficial and emphasize the critical role of model uncertainty in guiding exploration.
Paper Structure (23 sections, 15 equations, 7 figures, 2 tables, 1 algorithm)

This paper contains 23 sections, 15 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: Learning curves for all MuJoCo tasks were averaged over 10 seeds except for the Sparse Reacher task, which used 5 seeds. HOT-GP demonstrates equivalent or superior sample efficiency and performance for all tasks considered. The dashed line denotes SAC performance at convergence within 1,000,000 environment steps (3,000,000 for Half-Cheetah).
  • Figure 2: Learning curves on sparse maze tasks averaged over 10 seeds on the U Maze and 5 seeds on the Medium Maze. HOT-GP achieves equivalent or superior sample efficiency and performance on both tasks. The dashed line denotes SAC performance after 2,000,000 environment steps.
  • Figure 3: Learning curves in the coverage environment averaged over 10 seeds. HOT-GP achieves strong performance within 200,000 environment steps while other methods exhibit poorer sample efficiency or asymptotic performance. The dashed line denotes DDPG performance at convergence within 500,000 environment steps.
  • Figure 4: Frames from a rollout of a HOT-GP trained policy in the coverage environment. The agent (purple circle) drives around to cover target areas which are brightly colored. The target locations change on each episode.
  • Figure 5: Visualizations of the U Maze (left) and Medium Maze (right) environments with randomly selected goal locations.
  • ...and 2 more figures