Search-Based Credit Assignment for Offline Preference-Based Reinforcement Learning
Xiancheng Gao, Yufeng Shi, Wengang Zhou, Houqiang Li
TL;DR
The paper addresses the challenge of learning from offline human feedback when reward design is difficult by proposing Search-Based Preference Weighting (SPW), which computes transition-level weights by measuring similarity to expert demonstrations and integrates these weights into preference-based reward learning. SPW unifies demonstrations and preferences in a single-stage offline framework, enabling fine-grained credit assignment without online interaction or extra loss terms. Empirical results on Meta-World and DMControl show that SPW outperforms pure PbRL and hybrid baselines even with minimal expert data and a modest number of preference labels, and analysis reveals improved reward distributions and targeted credit signals that align with ground-truth rewards. The work demonstrates robust performance enhancements, practical sample efficiency, and strong potential for real-human feedback scenarios in robotic manipulation tasks.
Abstract
Offline reinforcement learning refers to the process of learning policies from fixed datasets, without requiring additional environment interaction. However, it often relies on well-defined reward functions, which are difficult and expensive to design. Human feedback is an appealing alternative, but its two common forms, expert demonstrations and preferences, have complementary limitations. Demonstrations provide stepwise supervision, but they are costly to collect and often reflect limited expert behavior modes. In contrast, preferences are easier to collect, but it is unclear which parts of a behavior contribute most to a trajectory segment, leaving credit assignment unresolved. In this paper, we introduce a Search-Based Preference Weighting (SPW) scheme to unify these two feedback sources. For each transition in a preference labeled trajectory, SPW searches for the most similar state-action pairs from expert demonstrations and directly derives stepwise importance weights based on their similarity scores. These weights are then used to guide standard preference learning, enabling more accurate credit assignment that traditional approaches struggle to achieve. We demonstrate that SPW enables effective joint learning from preferences and demonstrations, outperforming prior methods that leverage both feedback types on challenging robot manipulation tasks.
