PREDILECT: Preferences Delineated with Zero-Shot Language-based Reasoning in Reinforcement Learning
Simon Holk, Daniel Marta, Iolanda Leite
TL;DR
PREDILECT addresses the sample-inefficiency and causal confusion in preference-based RL by leveraging zero-shot language reasoning to extract sentiment and feature highlights from human prompts. By mapping prompts to state-action highlights and regularizing the reward model with these highlights, the method achieves faster convergence with fewer queries in both simulated and real human feedback settings. The approach demonstrates that incorporating natural language explanations can tailor robot policies to specific user objectives, particularly in social navigation, while maintaining competitive performance. Overall, PREDILECT offers a practical route to more efficient and human-aligned reward learning in robotics through LLM-guided reasoning and highlight-based regularization.
Abstract
Preference-based reinforcement learning (RL) has emerged as a new field in robot learning, where humans play a pivotal role in shaping robot behavior by expressing preferences on different sequences of state-action pairs. However, formulating realistic policies for robots demands responses from humans to an extensive array of queries. In this work, we approach the sample-efficiency challenge by expanding the information collected per query to contain both preferences and optional text prompting. To accomplish this, we leverage the zero-shot capabilities of a large language model (LLM) to reason from the text provided by humans. To accommodate the additional query information, we reformulate the reward learning objectives to contain flexible highlights -- state-action pairs that contain relatively high information and are related to the features processed in a zero-shot fashion from a pretrained LLM. In both a simulated scenario and a user study, we reveal the effectiveness of our work by analyzing the feedback and its implications. Additionally, the collective feedback collected serves to train a robot on socially compliant trajectories in a simulated social navigation landscape. We provide video examples of the trained policies at https://sites.google.com/view/rl-predilect
