Fill in the Blanks: Accelerating Q-Learning with a Handful of Demonstrations in Sparse Reward Settings
Seyed Mahdi Basiri Azad, Joschka Boedecker
TL;DR
Sparse rewards impede effective value propagation in Q-learning. The paper proposes a simple offline-to-online bootstrapping method that uses a handful of successful demonstrations to compute $V(s)$ via $G_t=\gamma^{T-t}$ (with terminal reward $r_T=1$) and initializes $Q(s,a)$ for demonstrated pairs as $Q(s_t,a_t)\leftarrow V(s_t)$, followed by online TD updates. This approach, which works even with a single or sub-optimal demonstration, is extended to continuous domains through a categorical value representation and separate offline/online replay, supported by regret-based analysis and empirical results showing faster convergence on tabular and continuous tasks under sparse rewards. The key contributions include a simple and robust demonstration-based bootstrapping technique, an extension to function approximation with stability via discretized value learning, and comprehensive ablations on demonstration count and quality. Overall, the method improves sample efficiency and exploration efficiency in sparse-reward RL without requiring dense rewards or large offline datasets, enabling more practical deployments in real-world tasks.
Abstract
Reinforcement learning (RL) in sparse-reward environments remains a significant challenge due to the lack of informative feedback. We propose a simple yet effective method that uses a small number of successful demonstrations to initialize the value function of an RL agent. By precomputing value estimates from offline demonstrations and using them as targets for early learning, our approach provides the agent with a useful prior over promising actions. The agent then refines these estimates through standard online interaction. This hybrid offline-to-online paradigm significantly reduces the exploration burden and improves sample efficiency in sparse-reward settings. Experiments on benchmark tasks demonstrate that our method accelerates convergence and outperforms standard baselines, even with minimal or suboptimal demonstration data.
