Table of Contents
Fetching ...

Reward Redistribution via Gaussian Process Likelihood Estimation

Minheng Xiao, Xian Yu

TL;DR

This work tackles delayed and sparse rewards in reinforcement learning by modeling per-step rewards as samples from a Gaussian Process and learning via a leave-one-out trajectory likelihood. It shows that the mean-squared-error based reward redistribution is a degenerate case of GP-LRR when the kernel is identity and noise vanishes, while leveraging the GP precision matrix to pool gradients across correlated state–action pairs. The method is integrated with Soft Actor-Critic, yielding dense, uncertainty-aware rewards that improve sample efficiency and final performance on MuJoCo benchmarks. Empirically, GP-LRR with SAC demonstrates superior credit assignment for long-horizon tasks with delayed feedback, offering a principled and scalable approach for off-policy RL.

Abstract

In many practical reinforcement learning tasks, feedback is only provided at the end of a long horizon, leading to sparse and delayed rewards. Existing reward redistribution methods typically assume that per-step rewards are independent, thus overlooking interdependencies among state-action pairs. In this paper, we propose a Gaussian process based Likelihood Reward Redistribution (GP-LRR) framework that addresses this issue by modeling the reward function as a sample from a Gaussian process, which explicitly captures dependencies between state-action pairs through the kernel function. By maximizing the likelihood of the observed episodic return via a leave-one-out strategy that leverages the entire trajectory, our framework inherently introduces uncertainty regularization. Moreover, we show that conventional mean-squared-error (MSE) based reward redistribution arises as a special case of our GP-LRR framework when using a degenerate kernel without observation noise. When integrated with an off-policy algorithm such as Soft Actor-Critic, GP-LRR yields dense and informative reward signals, resulting in superior sample efficiency and policy performance on several MuJoCo benchmarks.

Reward Redistribution via Gaussian Process Likelihood Estimation

TL;DR

This work tackles delayed and sparse rewards in reinforcement learning by modeling per-step rewards as samples from a Gaussian Process and learning via a leave-one-out trajectory likelihood. It shows that the mean-squared-error based reward redistribution is a degenerate case of GP-LRR when the kernel is identity and noise vanishes, while leveraging the GP precision matrix to pool gradients across correlated state–action pairs. The method is integrated with Soft Actor-Critic, yielding dense, uncertainty-aware rewards that improve sample efficiency and final performance on MuJoCo benchmarks. Empirically, GP-LRR with SAC demonstrates superior credit assignment for long-horizon tasks with delayed feedback, offering a principled and scalable approach for off-policy RL.

Abstract

In many practical reinforcement learning tasks, feedback is only provided at the end of a long horizon, leading to sparse and delayed rewards. Existing reward redistribution methods typically assume that per-step rewards are independent, thus overlooking interdependencies among state-action pairs. In this paper, we propose a Gaussian process based Likelihood Reward Redistribution (GP-LRR) framework that addresses this issue by modeling the reward function as a sample from a Gaussian process, which explicitly captures dependencies between state-action pairs through the kernel function. By maximizing the likelihood of the observed episodic return via a leave-one-out strategy that leverages the entire trajectory, our framework inherently introduces uncertainty regularization. Moreover, we show that conventional mean-squared-error (MSE) based reward redistribution arises as a special case of our GP-LRR framework when using a degenerate kernel without observation noise. When integrated with an off-policy algorithm such as Soft Actor-Critic, GP-LRR yields dense and informative reward signals, resulting in superior sample efficiency and policy performance on several MuJoCo benchmarks.

Paper Structure

This paper contains 27 sections, 4 theorems, 61 equations, 4 figures, 1 table, 2 algorithms.

Key Result

Proposition 1

The traditional MSE-based reward redistribution approach emerges as a special case of our GP framework. Specifically, when the kernel matrix reduces to identity ($\mathbf{K}_{\boldsymbol\phi} = \mathbf{I}$) and observation noise vanishes ($\sigma_\epsilon = 0$), the objective function for eq:GP-Obj

Figures (4)

  • Figure 1: (Left) HalfCheetah-v4 environment visualization. (Right) Lag-1 autocorrelation in rewards collected under random policy, where each point represents a $(r_t, r_{t+1})$ pair from trajectory rollouts.
  • Figure 2: Learning curves on different MuJoCo environments. Solid curves show average returns over 5 independent runs; shaded regions indicate standard deviation. Performance is evaluated every 5,000 environment steps. GP-LRR uses the RBF kernel in these experiments.
  • Figure 3: Learning curves on different MuJoCo environments with sparse episodic rewards using GP-LRR method with various kernels. Solid curves show average returns over 5 independent runs; shaded regions indicate standard deviation. Performance is evaluated every 5,000 environment steps.
  • Figure 4: Learning curves on different MuJoCo environments with sparse episodic rewards using GP-LRR method with different $\ell_{\text{rbf}}$ initialization. Solid curves show average returns over 5 independent runs; shaded regions indicate standard deviation. Performance is evaluated every 5,000 environment steps.

Theorems & Definitions (12)

  • Proposition 1: MSE as a Special Case
  • Remark 1
  • Proposition 2: Gradient Flow with Correlations
  • Remark 2: Credit Assignment through Correlations
  • Proposition 3: Length Scale and Smoothness Trade-off
  • Remark 3: Automatic Smoothness Adaptation
  • Proposition 4: Observation Noise Adaptation
  • Remark 4: Automatic Noise Level Selection
  • proof
  • proof
  • ...and 2 more