DreamSmooth: Improving Model-based Reinforcement Learning via Reward Smoothing
Vint Lee, Pieter Abbeel, Youngwoon Lee
TL;DR
This work identifies reward prediction as a central bottleneck in model-based RL, especially in sparse-reward and partially observable environments. It introduces DreamSmooth, a simple method that replaces exact per-timestep rewards with temporally-smoothed rewards computed via a kernel over a window of timesteps, and trains reward models on these smoothed signals. DreamSmooth, compatible with DreamerV3, TD-MPC, MBPO, and other backbones, delivers state-of-the-art sample efficiency and final performance on long-horizon sparse-reward tasks while not hurting dense benchmarks like DMC and Atari; ablations show robustness to smoothing parameters and kernel choice, with EMA smoothing offering a theoretical optimality guarantee via reward shaping. The approach improves reward prediction accuracy, enhances planning and policy learning, and demonstrates broad applicability with minimal implementation overhead, marking a practical advance for real-world, sparse-reward RL. In Crafter and other challenging settings, the results also illuminate trade-offs between smoothing-induced optimism and false positives, suggesting directions for refining kernel choices and smoothing strategies in future work.
Abstract
Model-based reinforcement learning (MBRL) has gained much attention for its ability to learn complex behaviors in a sample-efficient way: planning actions by generating imaginary trajectories with predicted rewards. Despite its success, we found that surprisingly, reward prediction is often a bottleneck of MBRL, especially for sparse rewards that are challenging (or even ambiguous) to predict. Motivated by the intuition that humans can learn from rough reward estimates, we propose a simple yet effective reward smoothing approach, DreamSmooth, which learns to predict a temporally-smoothed reward, instead of the exact reward at the given timestep. We empirically show that DreamSmooth achieves state-of-the-art performance on long-horizon sparse-reward tasks both in sample efficiency and final performance without losing performance on common benchmarks, such as Deepmind Control Suite and Atari benchmarks.
