Table of Contents
Fetching ...

Fill in the Blanks: Accelerating Q-Learning with a Handful of Demonstrations in Sparse Reward Settings

Seyed Mahdi Basiri Azad, Joschka Boedecker

TL;DR

Sparse rewards impede effective value propagation in Q-learning. The paper proposes a simple offline-to-online bootstrapping method that uses a handful of successful demonstrations to compute $V(s)$ via $G_t=\gamma^{T-t}$ (with terminal reward $r_T=1$) and initializes $Q(s,a)$ for demonstrated pairs as $Q(s_t,a_t)\leftarrow V(s_t)$, followed by online TD updates. This approach, which works even with a single or sub-optimal demonstration, is extended to continuous domains through a categorical value representation and separate offline/online replay, supported by regret-based analysis and empirical results showing faster convergence on tabular and continuous tasks under sparse rewards. The key contributions include a simple and robust demonstration-based bootstrapping technique, an extension to function approximation with stability via discretized value learning, and comprehensive ablations on demonstration count and quality. Overall, the method improves sample efficiency and exploration efficiency in sparse-reward RL without requiring dense rewards or large offline datasets, enabling more practical deployments in real-world tasks.

Abstract

Reinforcement learning (RL) in sparse-reward environments remains a significant challenge due to the lack of informative feedback. We propose a simple yet effective method that uses a small number of successful demonstrations to initialize the value function of an RL agent. By precomputing value estimates from offline demonstrations and using them as targets for early learning, our approach provides the agent with a useful prior over promising actions. The agent then refines these estimates through standard online interaction. This hybrid offline-to-online paradigm significantly reduces the exploration burden and improves sample efficiency in sparse-reward settings. Experiments on benchmark tasks demonstrate that our method accelerates convergence and outperforms standard baselines, even with minimal or suboptimal demonstration data.

Fill in the Blanks: Accelerating Q-Learning with a Handful of Demonstrations in Sparse Reward Settings

TL;DR

Sparse rewards impede effective value propagation in Q-learning. The paper proposes a simple offline-to-online bootstrapping method that uses a handful of successful demonstrations to compute via (with terminal reward ) and initializes for demonstrated pairs as , followed by online TD updates. This approach, which works even with a single or sub-optimal demonstration, is extended to continuous domains through a categorical value representation and separate offline/online replay, supported by regret-based analysis and empirical results showing faster convergence on tabular and continuous tasks under sparse rewards. The key contributions include a simple and robust demonstration-based bootstrapping technique, an extension to function approximation with stability via discretized value learning, and comprehensive ablations on demonstration count and quality. Overall, the method improves sample efficiency and exploration efficiency in sparse-reward RL without requiring dense rewards or large offline datasets, enabling more practical deployments in real-world tasks.

Abstract

Reinforcement learning (RL) in sparse-reward environments remains a significant challenge due to the lack of informative feedback. We propose a simple yet effective method that uses a small number of successful demonstrations to initialize the value function of an RL agent. By precomputing value estimates from offline demonstrations and using them as targets for early learning, our approach provides the agent with a useful prior over promising actions. The agent then refines these estimates through standard online interaction. This hybrid offline-to-online paradigm significantly reduces the exploration burden and improves sample efficiency in sparse-reward settings. Experiments on benchmark tasks demonstrate that our method accelerates convergence and outperforms standard baselines, even with minimal or suboptimal demonstration data.

Paper Structure

This paper contains 26 sections, 20 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: 3x3 grid with one demonstration
  • Figure 2: On-policy state visitation for Easy (top), Medium (middle), and Hard (bottom) versions of the environment. The left column shows the environments with ice holes represented in blue. The following columns show the on-policy state visitation probability implied by the policies from Converged Q-Learning (second column), Demo-initialized Q-Learning (third column), and zero-initialized Q-Learning (fourth column.
  • Figure 3: SODA can accelerate the learning, especially in sparse reward settings.
  • Figure 4: SODA can learn with as few as 1 demonstration.
  • Figure 5: SODA can learn to solve the environments using sub-optimal demonstrations.