Table of Contents
Fetching ...

Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning

Yue Wu, Shuangfei Zhai, Nitish Srivastava, Joshua Susskind, Jian Zhang, Ruslan Salakhutdinov, Hanlin Goh

TL;DR

The paper tackles offline reinforcement learning challenges from OOD state-action backups by introducing Uncertainty Weighted Actor-Critic (UWAC). UWAC leverages Monte Carlo dropout to estimate predictive uncertainty and down-weights high-uncertainty Bellman backups, stabilizing training without extra models. Empirically, UWAC achieves state-of-the-art performance on standard offline-RL benchmarks (MuJoCo D4RL) and demonstrates strong gains on Adroit hand tasks with sparse human demonstrations, largely due to improved Q-function stability. The approach preserves BEAR's data-support philosophy while addressing its instability in complex datasets, offering a practical, robust solution for offline RL.

Abstract

Offline Reinforcement Learning promises to learn effective policies from previously-collected, static datasets without the need for exploration. However, existing Q-learning and actor-critic based off-policy RL algorithms fail when bootstrapping from out-of-distribution (OOD) actions or states. We hypothesize that a key missing ingredient from the existing methods is a proper treatment of uncertainty in the offline setting. We propose Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly. Implementation-wise, we adopt a practical and effective dropout-based uncertainty estimation method that introduces very little overhead over existing RL algorithms. Empirically, we observe that UWAC substantially improves model stability during training. In addition, UWAC out-performs existing offline RL methods on a variety of competitive tasks, and achieves significant performance gains over the state-of-the-art baseline on datasets with sparse demonstrations collected from human experts.

Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning

TL;DR

The paper tackles offline reinforcement learning challenges from OOD state-action backups by introducing Uncertainty Weighted Actor-Critic (UWAC). UWAC leverages Monte Carlo dropout to estimate predictive uncertainty and down-weights high-uncertainty Bellman backups, stabilizing training without extra models. Empirically, UWAC achieves state-of-the-art performance on standard offline-RL benchmarks (MuJoCo D4RL) and demonstrates strong gains on Adroit hand tasks with sparse human demonstrations, largely due to improved Q-function stability. The approach preserves BEAR's data-support philosophy while addressing its instability in complex datasets, offering a practical, robust solution for offline RL.

Abstract

Offline Reinforcement Learning promises to learn effective policies from previously-collected, static datasets without the need for exploration. However, existing Q-learning and actor-critic based off-policy RL algorithms fail when bootstrapping from out-of-distribution (OOD) actions or states. We hypothesize that a key missing ingredient from the existing methods is a proper treatment of uncertainty in the offline setting. We propose Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly. Implementation-wise, we adopt a practical and effective dropout-based uncertainty estimation method that introduces very little overhead over existing RL algorithms. Empirically, we observe that UWAC substantially improves model stability during training. In addition, UWAC out-performs existing offline RL methods on a variety of competitive tasks, and achieves significant performance gains over the state-of-the-art baseline on datasets with sparse demonstrations collected from human experts.

Paper Structure

This paper contains 21 sections, 2 theorems, 13 equations, 17 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1.1

Suppose we run approximate distribution-constrained value iteration with a set constrained backup $\mathcal{T}^\Pi$ on a set of policies $\Pi$. Let $\delta(s,a)$ be the upper-bound for the Bellman approximation error for a given state-action pair $(s,a)$ over $k$ training steps: $\delta(s,a)= \sup_k with the suboptimality constant ($\alpha(\Pi)$) and the concentrability coefficient defined as:

Figures (17)

  • Figure 1: Left. Plot of average return v.s. training epochs of our proposed method (red) v.s. baseline (brown) BEAR on the relocate-expert offline dataset. Right. Corresponding plot of Q-Target values v.s. training epochs. Our proposed method achieves much higher average return, with better training stability, and more controlled Q-values.
  • Figure 2: Expert Trajectory Visualization. 2D heat maps of the expert's action distribution with respect to horizontal/vertical displacement from the goal location. Warmer locations represent more observations.
  • Figure 3: Top. The training set with horizontal displacements ($<0.1$) removed. This makes all states on the left OOD. Bottom. Our model estimates higher uncertainty (brighter color) on the left and lower uncertainty (colder color) on the right. We visualize the heatmap with the average speed of the lander, which is faster than observations at the bottom of the map. As a result, Fig \ref{['fig:lunarlander_original']} does not represent the actual frequency of training data, and the uncertainty should be compared horizontally, not vertically.
  • Figure 4: Uncertainty (estimated as variance) of state-action pairs from the (walker2d-expert) training dataset (green) compared to uncertainty estimates of the states combined with random actions from the same dataset. Since the action space for robotic control is quite small and noisy, a lot of random actions are actually in-distribution. Although the regions overlap, we achieve a ROC/AUC score of 0.845 for identifying OOD actions.
  • Figure 5: Our learned policies successfully accomplishes manipulation tasks, such as opening a door as shown.
  • ...and 12 more figures

Theorems & Definitions (3)

  • Theorem 1.1
  • Theorem 1.2
  • proof