Table of Contents
Fetching ...

Search-Based Credit Assignment for Offline Preference-Based Reinforcement Learning

Xiancheng Gao, Yufeng Shi, Wengang Zhou, Houqiang Li

TL;DR

The paper addresses the challenge of learning from offline human feedback when reward design is difficult by proposing Search-Based Preference Weighting (SPW), which computes transition-level weights by measuring similarity to expert demonstrations and integrates these weights into preference-based reward learning. SPW unifies demonstrations and preferences in a single-stage offline framework, enabling fine-grained credit assignment without online interaction or extra loss terms. Empirical results on Meta-World and DMControl show that SPW outperforms pure PbRL and hybrid baselines even with minimal expert data and a modest number of preference labels, and analysis reveals improved reward distributions and targeted credit signals that align with ground-truth rewards. The work demonstrates robust performance enhancements, practical sample efficiency, and strong potential for real-human feedback scenarios in robotic manipulation tasks.

Abstract

Offline reinforcement learning refers to the process of learning policies from fixed datasets, without requiring additional environment interaction. However, it often relies on well-defined reward functions, which are difficult and expensive to design. Human feedback is an appealing alternative, but its two common forms, expert demonstrations and preferences, have complementary limitations. Demonstrations provide stepwise supervision, but they are costly to collect and often reflect limited expert behavior modes. In contrast, preferences are easier to collect, but it is unclear which parts of a behavior contribute most to a trajectory segment, leaving credit assignment unresolved. In this paper, we introduce a Search-Based Preference Weighting (SPW) scheme to unify these two feedback sources. For each transition in a preference labeled trajectory, SPW searches for the most similar state-action pairs from expert demonstrations and directly derives stepwise importance weights based on their similarity scores. These weights are then used to guide standard preference learning, enabling more accurate credit assignment that traditional approaches struggle to achieve. We demonstrate that SPW enables effective joint learning from preferences and demonstrations, outperforming prior methods that leverage both feedback types on challenging robot manipulation tasks.

Search-Based Credit Assignment for Offline Preference-Based Reinforcement Learning

TL;DR

The paper addresses the challenge of learning from offline human feedback when reward design is difficult by proposing Search-Based Preference Weighting (SPW), which computes transition-level weights by measuring similarity to expert demonstrations and integrates these weights into preference-based reward learning. SPW unifies demonstrations and preferences in a single-stage offline framework, enabling fine-grained credit assignment without online interaction or extra loss terms. Empirical results on Meta-World and DMControl show that SPW outperforms pure PbRL and hybrid baselines even with minimal expert data and a modest number of preference labels, and analysis reveals improved reward distributions and targeted credit signals that align with ground-truth rewards. The work demonstrates robust performance enhancements, practical sample efficiency, and strong potential for real-human feedback scenarios in robotic manipulation tasks.

Abstract

Offline reinforcement learning refers to the process of learning policies from fixed datasets, without requiring additional environment interaction. However, it often relies on well-defined reward functions, which are difficult and expensive to design. Human feedback is an appealing alternative, but its two common forms, expert demonstrations and preferences, have complementary limitations. Demonstrations provide stepwise supervision, but they are costly to collect and often reflect limited expert behavior modes. In contrast, preferences are easier to collect, but it is unclear which parts of a behavior contribute most to a trajectory segment, leaving credit assignment unresolved. In this paper, we introduce a Search-Based Preference Weighting (SPW) scheme to unify these two feedback sources. For each transition in a preference labeled trajectory, SPW searches for the most similar state-action pairs from expert demonstrations and directly derives stepwise importance weights based on their similarity scores. These weights are then used to guide standard preference learning, enabling more accurate credit assignment that traditional approaches struggle to achieve. We demonstrate that SPW enables effective joint learning from preferences and demonstrations, outperforming prior methods that leverage both feedback types on challenging robot manipulation tasks.

Paper Structure

This paper contains 27 sections, 8 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: An overview of SPW. (a) Weight extraction: Importance weights are computed for each transition in the preference-labeled trajectories based on their similarity to expert demonstrations. Darker colors indicate higher weights. (b) Weighted reward optimization: The return of a preference trajectory is modeled as the weighted sum of stepwise rewards. This formulation is then integrated into the standard preference learning framework.
  • Figure 2: Normalized reward curves of MR, PT, and GT for trajectory segments in the box-close task. Snapshots from selected positions along the segment are shown for visual reference.
  • Figure 3: Normalized reward profiles of MR, SPW, and GT within a trajectory segment in the box-close task. Snapshots from selected positions along the segment are shown for visual reference.
  • Figure 4: Comparison of the reward distributions learned by MR and SPW with the ground‑truth (GT) reward in peg‑unplug‑side task. The plot annotates the KL divergence between each learned distribution (SPW and MR) and the GT distribution.
  • Figure 5: Average success rates of SPW when adjusting the temperature $\tau$. We use a total of 200 preference feedbacks.
  • ...and 1 more figures