Table of Contents
Fetching ...

Uncertainty-Based Smooth Policy Regularisation for Reinforcement Learning with Few Demonstrations

Yujie Zhu, Charles A. Hepburn, Matthew Thorpe, Giovanni Montana

TL;DR

This work introduces SPReD, a framework for uncertainty-aware, smooth policy regularisation from demonstrations in reinforcement learning with sparse rewards. By modeling two Q-value distributions from an ensemble of critics (demonstration actions vs. current policy actions), SPReD replaces binary imitation decisions with continuous weights that scale the behaviour cloning loss. It presents two weighting schemes: SPReD-P (probabilistic likelihood of demonstration superiority) and SPReD-E (exponential advantage weighting), both offering gradient-variance reduction and adaptive imitation strength. Theoretical analysis proves variance reduction and adaptive behavior under uncertainty, while experiments across eight robotics tasks show substantial gains (up to 14× in complex manipulation) and robustness to demonstration quality and quantity. Overall, SPReD delivers state-of-the-art sample efficiency with modest computational overhead, suggesting practical impact for learning from limited demonstrations in real-world robotics.

Abstract

In reinforcement learning with sparse rewards, demonstrations can accelerate learning, but determining when to imitate them remains challenging. We propose Smooth Policy Regularisation from Demonstrations (SPReD), a framework that addresses the fundamental question: when should an agent imitate a demonstration versus follow its own policy? SPReD uses ensemble methods to explicitly model Q-value distributions for both demonstration and policy actions, quantifying uncertainty for comparisons. We develop two complementary uncertainty-aware methods: a probabilistic approach estimating the likelihood of demonstration superiority, and an advantage-based approach scaling imitation by statistical significance. Unlike prevailing methods (e.g. Q-filter) that make binary imitation decisions, SPReD applies continuous, uncertainty-proportional regularisation weights, reducing gradient variance during training. Despite its computational simplicity, SPReD achieves remarkable gains in experiments across eight robotics tasks, outperforming existing approaches by up to a factor of 14 in complex tasks while maintaining robustness to demonstration quality and quantity. Our code is available at https://github.com/YujieZhu7/SPReD.

Uncertainty-Based Smooth Policy Regularisation for Reinforcement Learning with Few Demonstrations

TL;DR

This work introduces SPReD, a framework for uncertainty-aware, smooth policy regularisation from demonstrations in reinforcement learning with sparse rewards. By modeling two Q-value distributions from an ensemble of critics (demonstration actions vs. current policy actions), SPReD replaces binary imitation decisions with continuous weights that scale the behaviour cloning loss. It presents two weighting schemes: SPReD-P (probabilistic likelihood of demonstration superiority) and SPReD-E (exponential advantage weighting), both offering gradient-variance reduction and adaptive imitation strength. Theoretical analysis proves variance reduction and adaptive behavior under uncertainty, while experiments across eight robotics tasks show substantial gains (up to 14× in complex manipulation) and robustness to demonstration quality and quantity. Overall, SPReD delivers state-of-the-art sample efficiency with modest computational overhead, suggesting practical impact for learning from limited demonstrations in real-world robotics.

Abstract

In reinforcement learning with sparse rewards, demonstrations can accelerate learning, but determining when to imitate them remains challenging. We propose Smooth Policy Regularisation from Demonstrations (SPReD), a framework that addresses the fundamental question: when should an agent imitate a demonstration versus follow its own policy? SPReD uses ensemble methods to explicitly model Q-value distributions for both demonstration and policy actions, quantifying uncertainty for comparisons. We develop two complementary uncertainty-aware methods: a probabilistic approach estimating the likelihood of demonstration superiority, and an advantage-based approach scaling imitation by statistical significance. Unlike prevailing methods (e.g. Q-filter) that make binary imitation decisions, SPReD applies continuous, uncertainty-proportional regularisation weights, reducing gradient variance during training. Despite its computational simplicity, SPReD achieves remarkable gains in experiments across eight robotics tasks, outperforming existing approaches by up to a factor of 14 in complex tasks while maintaining robustness to demonstration quality and quantity. Our code is available at https://github.com/YujieZhu7/SPReD.

Paper Structure

This paper contains 52 sections, 1 theorem, 37 equations, 12 figures, 5 tables, 1 algorithm.

Key Result

Lemma 5.1

Assuming (A1) gradient norms are bounded and (A2) demonstrations are independently sampled, let $X_k=\mathds{1}_k\,g_k$, $Y_k=p_k\,g_k$, $g_k=\nabla_\phi\|\pi_\phi(s_k)-a_k\|^2$. Then where $\mathds{1}_k$ represents binary filtering decisions, $p_k \in [0,1]$ represents our continuous weights, and $N_D$ is the batch size. Strict inequality holds if $\mathbb{P}(0<p_k<1)>0$.

Figures (12)

  • Figure 1: Performance comparison across eight robotics tasks. Solid lines represent mean success rates across 5 seeds, with shaded areas showing standard deviation. The learning curves are smoothed using a 5-point moving average. Horizontal dashed lines indicate the success rates of the demonstrations used for training. Our SPReD methods (red and brown) consistently outperform baselines across environments of varying complexity.
  • Figure 2: Effect of demonstration quality in FetchPickAndPlace. The demonstrations are expert, suboptimal and severely suboptimal from left to right with success rates shown as dashed lines.
  • Figure 3: Effect of demonstration size in FetchPickAndPlace. The demonstrations collected from the same policy contain 5, 10, 20, or 50 episodes with success rates shown as the dashed lines.
  • Figure 4: Computational cost for the individual experimental runs for FetchPickAndPlace with 4e6 steps.
  • Figure 5: Performance comparison across five locomotion tasks. Horizontal dashed lines indicate the scores of the demonstrations used for training. Our SPReD methods (red and brown) consistently outperform baselines across different tasks.
  • ...and 7 more figures

Theorems & Definitions (8)

  • Lemma 5.1: Gradient‐variance gap
  • proof
  • proof
  • Remark
  • Remark
  • proof
  • Remark
  • Remark