Table of Contents
Fetching ...

PREDILECT: Preferences Delineated with Zero-Shot Language-based Reasoning in Reinforcement Learning

Simon Holk, Daniel Marta, Iolanda Leite

TL;DR

PREDILECT addresses the sample-inefficiency and causal confusion in preference-based RL by leveraging zero-shot language reasoning to extract sentiment and feature highlights from human prompts. By mapping prompts to state-action highlights and regularizing the reward model with these highlights, the method achieves faster convergence with fewer queries in both simulated and real human feedback settings. The approach demonstrates that incorporating natural language explanations can tailor robot policies to specific user objectives, particularly in social navigation, while maintaining competitive performance. Overall, PREDILECT offers a practical route to more efficient and human-aligned reward learning in robotics through LLM-guided reasoning and highlight-based regularization.

Abstract

Preference-based reinforcement learning (RL) has emerged as a new field in robot learning, where humans play a pivotal role in shaping robot behavior by expressing preferences on different sequences of state-action pairs. However, formulating realistic policies for robots demands responses from humans to an extensive array of queries. In this work, we approach the sample-efficiency challenge by expanding the information collected per query to contain both preferences and optional text prompting. To accomplish this, we leverage the zero-shot capabilities of a large language model (LLM) to reason from the text provided by humans. To accommodate the additional query information, we reformulate the reward learning objectives to contain flexible highlights -- state-action pairs that contain relatively high information and are related to the features processed in a zero-shot fashion from a pretrained LLM. In both a simulated scenario and a user study, we reveal the effectiveness of our work by analyzing the feedback and its implications. Additionally, the collective feedback collected serves to train a robot on socially compliant trajectories in a simulated social navigation landscape. We provide video examples of the trained policies at https://sites.google.com/view/rl-predilect

PREDILECT: Preferences Delineated with Zero-Shot Language-based Reasoning in Reinforcement Learning

TL;DR

PREDILECT addresses the sample-inefficiency and causal confusion in preference-based RL by leveraging zero-shot language reasoning to extract sentiment and feature highlights from human prompts. By mapping prompts to state-action highlights and regularizing the reward model with these highlights, the method achieves faster convergence with fewer queries in both simulated and real human feedback settings. The approach demonstrates that incorporating natural language explanations can tailor robot policies to specific user objectives, particularly in social navigation, while maintaining competitive performance. Overall, PREDILECT offers a practical route to more efficient and human-aligned reward learning in robotics through LLM-guided reasoning and highlight-based regularization.

Abstract

Preference-based reinforcement learning (RL) has emerged as a new field in robot learning, where humans play a pivotal role in shaping robot behavior by expressing preferences on different sequences of state-action pairs. However, formulating realistic policies for robots demands responses from humans to an extensive array of queries. In this work, we approach the sample-efficiency challenge by expanding the information collected per query to contain both preferences and optional text prompting. To accomplish this, we leverage the zero-shot capabilities of a large language model (LLM) to reason from the text provided by humans. To accommodate the additional query information, we reformulate the reward learning objectives to contain flexible highlights -- state-action pairs that contain relatively high information and are related to the features processed in a zero-shot fashion from a pretrained LLM. In both a simulated scenario and a user study, we reveal the effectiveness of our work by analyzing the feedback and its implications. Additionally, the collective feedback collected serves to train a robot on socially compliant trajectories in a simulated social navigation landscape. We provide video examples of the trained policies at https://sites.google.com/view/rl-predilect
Paper Structure (23 sections, 8 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 23 sections, 8 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: An overview of PREDILECT in a social navigation scenario: Initially, a human is shown two trajectories, A and B. They signal their preference for one of the trajectories and provide an additional text prompt to elaborate on their insights. Subsequently, an LLM can be employed for extracting feature sentiment, revealing the causal reasoning embedded in their text prompt, which is processed and mapped to a set of intrinsic values. Finally, both the preferences and the highlighted insights are utilized to more accurately define a reward function.
  • Figure 2: Representation of highlights within a segment. The segment $\sigma$ outlined by a curve contains multiple highlights, two negative in $\mathcal{H}_{\mathcal{F}}^-$ depicted in red and one positive $\mathcal{H}_{\mathcal{F}}^+$ depicted in green $h^+$. All highlights are of the same length $L$.
  • Figure 3: An overview of how PREDILECT processes prompts from humans is as follows: Initially, a human provides a prompt, depicted in green, along with a set of intrinsic features $\mathcal{F}$ in purple which is environment dependant. Both are input into the LLM (ChatGPT-4 in the case of PREDILECT) to generate a response $\text{r}_i$. Subsequently, after mapping a segment $\sigma$ to a tensor of metrics $\mathcal{T}$ using the mapping function $M$, we apply a searching function $g$ to obtain the set $\mathcal{H}_\mathcal{F}$ of highlights for each feature. These highlights are then utilized to train our reward model $\hat{r}_{\psi}$ as per Eq.\ref{['eq:predilect']}.
  • Figure 4: Framework representation of PREDILECT. Step A: We train policy $\pi_\omega$ and sample rollouts which are stored in $\mathcal{D}_{\sigma}$. Step B: We sample trajectory segments $\sigma$ to query humans and collect both preferences and prompts. The prompts are processed through an LLM to obtain responses. Those responses are used to obtain highlights $(\mathcal{H}_\mathcal{F}^+, \mathcal{H}_\mathcal{F}^-)$ from the preferred segment $\sigma^*$ Step C: The sentiment highlighted queries are collected to form dataset $\mathcal{D}_{shq}$ and update the current reward model $\hat{r}_{\psi}$.
  • Figure 5: Learning curves: for Reacher (a), PREDILECT used 200 queries, Baseline used 400 queries; for Cheetah (b), PREDILECT used 200 queries, Baseline used 400 queries.
  • ...and 2 more figures