Table of Contents
Fetching ...

ICPL: Few-shot In-context Preference Learning via LLMs

Chao Yu, Qixin Tan, Hong Lu, Jiaxuan Gao, Xinting Yang, Yu Wang, Yi Wu, Eugene Vinitsky

TL;DR

ICPL tackles reward specification in preference-based reinforcement learning by leveraging large language models to generate executable reward functions in-context and refine them through human preferences. It embeds an online loop in which an LLM proposes reward functions, reinforcement learning trains policies, and human or proxy feedback guides successive reward generation, assisted by automatic signals to enrich prompts. Across proxy and real-human experiments on IsaacGym tasks and a HumanoidJump task, ICPL achieves orders‑of‑magnitude reductions in human queries and competitive performance relative to reward‑design baselines, including ground‑truth sparse rewards. The work demonstrates that LLMs can serve as few-shot preference learners to structure reward signals, enabling scalable, human‑in‑the‑loop PbRL for complex, subjective tasks.

Abstract

Preference-based reinforcement learning is an effective way to handle tasks where rewards are hard to specify but can be exceedingly inefficient as preference learning is often tabula rasa. We demonstrate that Large Language Models (LLMs) have native preference-learning capabilities that allow them to achieve sample-efficient preference learning, addressing this challenge. We propose In-Context Preference Learning (ICPL), which uses in-context learning capabilities of LLMs to reduce human query inefficiency. ICPL uses the task description and basic environment code to create sets of reward functions which are iteratively refined by placing human feedback over videos of the resultant policies into the context of an LLM and then requesting better rewards. We first demonstrate ICPL's effectiveness through a synthetic preference study, providing quantitative evidence that it significantly outperforms baseline preference-based methods with much higher performance and orders of magnitude greater efficiency. We observe that these improvements are not solely coming from LLM grounding in the task but that the quality of the rewards improves over time, indicating preference learning capabilities. Additionally, we perform a series of real human preference-learning trials and observe that ICPL extends beyond synthetic settings and can work effectively with humans-in-the-loop.

ICPL: Few-shot In-context Preference Learning via LLMs

TL;DR

ICPL tackles reward specification in preference-based reinforcement learning by leveraging large language models to generate executable reward functions in-context and refine them through human preferences. It embeds an online loop in which an LLM proposes reward functions, reinforcement learning trains policies, and human or proxy feedback guides successive reward generation, assisted by automatic signals to enrich prompts. Across proxy and real-human experiments on IsaacGym tasks and a HumanoidJump task, ICPL achieves orders‑of‑magnitude reductions in human queries and competitive performance relative to reward‑design baselines, including ground‑truth sparse rewards. The work demonstrates that LLMs can serve as few-shot preference learners to structure reward signals, enabling scalable, human‑in‑the‑loop PbRL for complex, subjective tasks.

Abstract

Preference-based reinforcement learning is an effective way to handle tasks where rewards are hard to specify but can be exceedingly inefficient as preference learning is often tabula rasa. We demonstrate that Large Language Models (LLMs) have native preference-learning capabilities that allow them to achieve sample-efficient preference learning, addressing this challenge. We propose In-Context Preference Learning (ICPL), which uses in-context learning capabilities of LLMs to reduce human query inefficiency. ICPL uses the task description and basic environment code to create sets of reward functions which are iteratively refined by placing human feedback over videos of the resultant policies into the context of an LLM and then requesting better rewards. We first demonstrate ICPL's effectiveness through a synthetic preference study, providing quantitative evidence that it significantly outperforms baseline preference-based methods with much higher performance and orders of magnitude greater efficiency. We observe that these improvements are not solely coming from LLM grounding in the task but that the quality of the rewards improves over time, indicating preference learning capabilities. Additionally, we perform a series of real human preference-learning trials and observe that ICPL extends beyond synthetic settings and can work effectively with humans-in-the-loop.

Paper Structure

This paper contains 35 sections, 3 equations, 4 figures, 9 tables, 3 algorithms.

Figures (4)

  • Figure 1: ICPL employs the LLM to generate initial $K$ executable reward functions based on the task description and environment context. Using RL, agents are trained with these reward functions. Videos are generated of the resultant agent behavior from which human evaluators select their most and least preferred. These selections serve as examples of positive and negative preferences. The preferences, along with additional contextual information, are provided as feedback prompts to the LLM, which is then requested to synthesize a new set of reward functions. For experiments simulating human evaluators, task scores are used to determine the best and worst reward functions.
  • Figure 2: Average improvement of the Reward Task Score (RTS) over successive iterations relative to the first iteration in ICPL for the Ant and ShadowHand tasks, demonstrating the method's effectiveness in refining reward functions.
  • Figure 3: A common behavior.
  • Figure 4: The humanoid learns a human-like jump by bending legs and lowering the upper body to shift the center of mass in a trial of human-in-the-loop experiments. Note that both legs are used to jump and the agent bends at the hips.