Table of Contents
Fetching ...

Dense Dynamics-Aware Reward Synthesis: Integrating Prior Experience with Demonstrations

Cevahir Koprulu, Po-han Li, Tianyu Qiu, Ruihan Zhao, Tyler Westenbroek, David Fridovich-Keil, Sandeep Chinchali, Ufuk Topcu

TL;DR

This work tackles long-horizon sparse-reward reinforcement learning by marrying a task-agnostic prior dataset with a small set of task demonstrations through a potential-based reward shaping framework. The method learns a goal-conditioned value estimator $\tilde{V}_g$ from prior data and constructs a global potential $\Phi(s)$ by combining $\tilde{V}_g$ with demonstrations via $\Phi(s) = \max_j \max_{s_t^j \in \Delta(s)} [V_d^j(s_t^j) + \tilde{V}_g(s; s_t^j)]$, then defines a dense reward $\bar{r}(s_t,a_t) = r(s_t,a_t) + \gamma \Phi(s_{t+1}) - \Phi(s_t)$. This PBRS-based approach biases exploration toward demonstration trajectories while preserving the original optimal policies, and it can be enhanced by annealing the discount factor to promote early reliance on the prior. The method demonstrates substantial speedups in learning and robustness to suboptimal demonstrations and dynamics shifts, with potential applicability to partial observations and richer sensory inputs. Overall, the framework provides a principled way to fuse offline priors and online demonstrations to improve sample efficiency in challenging long-horizon control tasks.

Abstract

Many continuous control problems can be formulated as sparse-reward reinforcement learning (RL) tasks. In principle, online RL methods can automatically explore the state space to solve each new task. However, discovering sequences of actions that lead to a non-zero reward becomes exponentially more difficult as the task horizon increases. Manually shaping rewards can accelerate learning for a fixed task, but it is an arduous process that must be repeated for each new environment. We introduce a systematic reward-shaping framework that distills the information contained in 1) a task-agnostic prior data set and 2) a small number of task-specific expert demonstrations, and then uses these priors to synthesize dense dynamics-aware rewards for the given task. This supervision substantially accelerates learning in our experiments, and we provide analysis demonstrating how the approach can effectively guide online learning agents to faraway goals.

Dense Dynamics-Aware Reward Synthesis: Integrating Prior Experience with Demonstrations

TL;DR

This work tackles long-horizon sparse-reward reinforcement learning by marrying a task-agnostic prior dataset with a small set of task demonstrations through a potential-based reward shaping framework. The method learns a goal-conditioned value estimator from prior data and constructs a global potential by combining with demonstrations via , then defines a dense reward . This PBRS-based approach biases exploration toward demonstration trajectories while preserving the original optimal policies, and it can be enhanced by annealing the discount factor to promote early reliance on the prior. The method demonstrates substantial speedups in learning and robustness to suboptimal demonstrations and dynamics shifts, with potential applicability to partial observations and richer sensory inputs. Overall, the framework provides a principled way to fuse offline priors and online demonstrations to improve sample efficiency in challenging long-horizon control tasks.

Abstract

Many continuous control problems can be formulated as sparse-reward reinforcement learning (RL) tasks. In principle, online RL methods can automatically explore the state space to solve each new task. However, discovering sequences of actions that lead to a non-zero reward becomes exponentially more difficult as the task horizon increases. Manually shaping rewards can accelerate learning for a fixed task, but it is an arduous process that must be repeated for each new environment. We introduce a systematic reward-shaping framework that distills the information contained in 1) a task-agnostic prior data set and 2) a small number of task-specific expert demonstrations, and then uses these priors to synthesize dense dynamics-aware rewards for the given task. This supervision substantially accelerates learning in our experiments, and we provide analysis demonstrating how the approach can effectively guide online learning agents to faraway goals.

Paper Structure

This paper contains 18 sections, 1 theorem, 11 equations, 4 figures, 2 tables.

Key Result

Proposition 1

Assume each state $s \in \mathcal{S}$, in-distribution goal $s_d \in \Delta(s)$, and value estimator $\tilde{V}_g$ satisfies the following conditions for some $\epsilon>0$: 1. $\tilde{V}_g(s;s_d) \in [0,1]$ and $\tilde{V}_g(s;s_d) =1$ iff $s =s_d$; and, 2. $\max_a\tilde{V}_g(f(s,a);s_d) > (1+\epsilo

Figures (4)

  • Figure 1: A depiction of our framework. (Cyan) We distill information about the dynamics of a prior data set into a goal-conditioned value estimator $\tilde{V}_g$. (Red) An expert provides demonstrations for a new target task. (Purple) The framework combines them to construct a potential function $\Phi(s)$\ref{['eq:sinle_pot']}, implicitly estimating the number of steps needed to reach the task-specific goal from state $s$. For this estimate, $\Phi^j(s)$ first measures \ref{['eq:demo_pot']} the number of steps needed to reach the $j$-th demonstration $\tau^j$ via $\tilde{V}_g$, and then follow it to the goal via $V_d^j$\ref{['eq:demo_reward']}. (Green) We use the overall potential to synthesize dense dynamics-aware rewards for the target task.
  • Figure 2: (a) Push slot environment in Robosuite. (b) Potential heatmap for the push slot with two reset grids.
  • Figure 3: Training progression in push slot tasks. We present the median (bold) and quartiles (shaded area) of success rates. Our approach leads to faster learning and more stable performance across tasks.
  • Figure 4: Potential heatmap for the push slot with single reset grid. The agent constructs the $V_d^{j}(s_t^j) + \tilde{V}_g(s;s_j^t)$ in \ref{['eq:demo_pot']} and \ref{['eq:sinle_pot']} from a fixed state $s$ (green) and different demonstrated states (purple to yellow). The state chosen by the maximizations in \ref{['eq:sinle_pot']} when constructing $\Phi$ is marked in pink.

Theorems & Definitions (2)

  • Proposition 1
  • proof