Table of Contents
Fetching ...

DreamSmooth: Improving Model-based Reinforcement Learning via Reward Smoothing

Vint Lee, Pieter Abbeel, Youngwoon Lee

TL;DR

This work identifies reward prediction as a central bottleneck in model-based RL, especially in sparse-reward and partially observable environments. It introduces DreamSmooth, a simple method that replaces exact per-timestep rewards with temporally-smoothed rewards computed via a kernel over a window of timesteps, and trains reward models on these smoothed signals. DreamSmooth, compatible with DreamerV3, TD-MPC, MBPO, and other backbones, delivers state-of-the-art sample efficiency and final performance on long-horizon sparse-reward tasks while not hurting dense benchmarks like DMC and Atari; ablations show robustness to smoothing parameters and kernel choice, with EMA smoothing offering a theoretical optimality guarantee via reward shaping. The approach improves reward prediction accuracy, enhances planning and policy learning, and demonstrates broad applicability with minimal implementation overhead, marking a practical advance for real-world, sparse-reward RL. In Crafter and other challenging settings, the results also illuminate trade-offs between smoothing-induced optimism and false positives, suggesting directions for refining kernel choices and smoothing strategies in future work.

Abstract

Model-based reinforcement learning (MBRL) has gained much attention for its ability to learn complex behaviors in a sample-efficient way: planning actions by generating imaginary trajectories with predicted rewards. Despite its success, we found that surprisingly, reward prediction is often a bottleneck of MBRL, especially for sparse rewards that are challenging (or even ambiguous) to predict. Motivated by the intuition that humans can learn from rough reward estimates, we propose a simple yet effective reward smoothing approach, DreamSmooth, which learns to predict a temporally-smoothed reward, instead of the exact reward at the given timestep. We empirically show that DreamSmooth achieves state-of-the-art performance on long-horizon sparse-reward tasks both in sample efficiency and final performance without losing performance on common benchmarks, such as Deepmind Control Suite and Atari benchmarks.

DreamSmooth: Improving Model-based Reinforcement Learning via Reward Smoothing

TL;DR

This work identifies reward prediction as a central bottleneck in model-based RL, especially in sparse-reward and partially observable environments. It introduces DreamSmooth, a simple method that replaces exact per-timestep rewards with temporally-smoothed rewards computed via a kernel over a window of timesteps, and trains reward models on these smoothed signals. DreamSmooth, compatible with DreamerV3, TD-MPC, MBPO, and other backbones, delivers state-of-the-art sample efficiency and final performance on long-horizon sparse-reward tasks while not hurting dense benchmarks like DMC and Atari; ablations show robustness to smoothing parameters and kernel choice, with EMA smoothing offering a theoretical optimality guarantee via reward shaping. The approach improves reward prediction accuracy, enhances planning and policy learning, and demonstrates broad applicability with minimal implementation overhead, marking a practical advance for real-world, sparse-reward RL. In Crafter and other challenging settings, the results also illuminate trade-offs between smoothing-induced optimism and false positives, suggesting directions for refining kernel choices and smoothing strategies in future work.

Abstract

Model-based reinforcement learning (MBRL) has gained much attention for its ability to learn complex behaviors in a sample-efficient way: planning actions by generating imaginary trajectories with predicted rewards. Despite its success, we found that surprisingly, reward prediction is often a bottleneck of MBRL, especially for sparse rewards that are challenging (or even ambiguous) to predict. Motivated by the intuition that humans can learn from rough reward estimates, we propose a simple yet effective reward smoothing approach, DreamSmooth, which learns to predict a temporally-smoothed reward, instead of the exact reward at the given timestep. We empirically show that DreamSmooth achieves state-of-the-art performance on long-horizon sparse-reward tasks both in sample efficiency and final performance without losing performance on common benchmarks, such as Deepmind Control Suite and Atari benchmarks.
Paper Structure (28 sections, 2 theorems, 15 equations, 21 figures, 3 tables, 1 algorithm)

This paper contains 28 sections, 2 theorems, 15 equations, 21 figures, 3 tables, 1 algorithm.

Key Result

Theorem A.1

An optimal policy $\tilde{\pi}^*$ of the MDP with reward smoothing only with past rewards, e.g., EMA smoothing, $\tilde{\mathcal{M}}=(\mathcal{S}, \mathcal{A}, P, \tilde{R}, \gamma)$ is also optimal under the original MDP $\mathcal{M}$, where

Figures (21)

  • Figure 1: Predicting the exact sequence of rewards is extremely difficult. These examples show the sequences of image observations seen by the agent just before and after it receives a large reward. There is little to visually distinguish timesteps with a large reward from those without, which creates a significant challenge for reward prediction.
  • Figure 2: Ground truth rewards and DreamerV3's predicted rewards over an evaluation episode. The reward model misses many sparse rewards, which is highlighted in yellow.
  • Figure 3: The reward model's inability to predict sparse rewards for completing tasks leads to poor task performance. (a) In RoboDesk, the agent gets stuck after learning the first task, and is unable to learn to perform the subsequent tasks. (b) In Earthmoving, the policy often fails to dump the rocks accurately into the dumptruck. The learning curves are averaged over $3$ seeds.
  • Figure 4: Reward smoothing on sparse reward $1$ at $t=4$. $\sigma$, $\delta$, and $\alpha$ are smoothing hyperparameters.
  • Figure 5: We evaluate DreamSmooth on four tasks with sparse subtask completion rewards (a-d). We also test on two popular benchmarks, (e) DeepMind Control Suite and (f) Atari.
  • ...and 16 more figures

Theorems & Definitions (4)

  • Theorem A.1
  • proof
  • Theorem A.2
  • proof