Table of Contents
Fetching ...

Wasserstein Barycenter Soft Actor-Critic

Zahra Shahrooei, Ali Baheri

TL;DR

This paper proposes Wasserstein Barycenter Soft Actor-Critic (WBSAC) algorithm, which benefits from a pessimistic actor for temporal difference learning and an optimistic actor to promote exploration, and uses the Wasserstein barycenter of the pessimistic and optimistic policies as the exploration policy and adjusting the degree of exploration throughout the learning process.

Abstract

Deep off-policy actor-critic algorithms have emerged as the leading framework for reinforcement learning in continuous control domains. However, most of these algorithms suffer from poor sample efficiency, especially in environments with sparse rewards. In this paper, we take a step towards addressing this issue by providing a principled directed exploration strategy. We propose Wasserstein Barycenter Soft Actor-Critic (WBSAC) algorithm, which benefits from a pessimistic actor for temporal difference learning and an optimistic actor to promote exploration. This is achieved by using the Wasserstein barycenter of the pessimistic and optimistic policies as the exploration policy and adjusting the degree of exploration throughout the learning process. We compare WBSAC with state-of-the-art off-policy actor-critic algorithms and show that WBSAC is more sample-efficient on MuJoCo continuous control tasks.

Wasserstein Barycenter Soft Actor-Critic

TL;DR

This paper proposes Wasserstein Barycenter Soft Actor-Critic (WBSAC) algorithm, which benefits from a pessimistic actor for temporal difference learning and an optimistic actor to promote exploration, and uses the Wasserstein barycenter of the pessimistic and optimistic policies as the exploration policy and adjusting the degree of exploration throughout the learning process.

Abstract

Deep off-policy actor-critic algorithms have emerged as the leading framework for reinforcement learning in continuous control domains. However, most of these algorithms suffer from poor sample efficiency, especially in environments with sparse rewards. In this paper, we take a step towards addressing this issue by providing a principled directed exploration strategy. We propose Wasserstein Barycenter Soft Actor-Critic (WBSAC) algorithm, which benefits from a pessimistic actor for temporal difference learning and an optimistic actor to promote exploration. This is achieved by using the Wasserstein barycenter of the pessimistic and optimistic policies as the exploration policy and adjusting the degree of exploration throughout the learning process. We compare WBSAC with state-of-the-art off-policy actor-critic algorithms and show that WBSAC is more sample-efficient on MuJoCo continuous control tasks.

Paper Structure

This paper contains 10 sections, 1 theorem, 21 equations, 11 figures, 3 tables.

Key Result

Proposition 1

For factorized Gaussian pessimistic and optimistic policies, the exploration policy $\pi_e$ (derived from (eq:barycenter_mean_wbsac) and (eq:barycenter_cov_wbsac) has its differential entropy, $H(\pi_e(s))$, lower-bounded for any state $s \in \mathcal{S}$ as:

Figures (11)

  • Figure 1: WBSAC uses Wasserstein barycenter of optimistic and pessimistic policies as the exploration policy.
  • Figure 3: Performance comparison on MuJoCo environemnts. WBSAC outperforms SAC and DARC in all tasks and OAC in three.
  • Figure 4: Performance comparison on DeepMind control suite tasks. WBSAC consistently outperforms SAC and DARC, while it shows better or comparable performance with respect to OAC.
  • Figure 5: Average coverage over 3 seeds in the PointMaze Medium-v3 navigation task with sparse reward.
  • Figure 6: State visitation heatmaps for pessimistic (first row) and exploration policies (second row).
  • ...and 6 more figures

Theorems & Definitions (2)

  • Proposition 1
  • proof