ICPL: Few-shot In-context Preference Learning via LLMs
Chao Yu, Qixin Tan, Hong Lu, Jiaxuan Gao, Xinting Yang, Yu Wang, Yi Wu, Eugene Vinitsky
TL;DR
ICPL tackles reward specification in preference-based reinforcement learning by leveraging large language models to generate executable reward functions in-context and refine them through human preferences. It embeds an online loop in which an LLM proposes reward functions, reinforcement learning trains policies, and human or proxy feedback guides successive reward generation, assisted by automatic signals to enrich prompts. Across proxy and real-human experiments on IsaacGym tasks and a HumanoidJump task, ICPL achieves orders‑of‑magnitude reductions in human queries and competitive performance relative to reward‑design baselines, including ground‑truth sparse rewards. The work demonstrates that LLMs can serve as few-shot preference learners to structure reward signals, enabling scalable, human‑in‑the‑loop PbRL for complex, subjective tasks.
Abstract
Preference-based reinforcement learning is an effective way to handle tasks where rewards are hard to specify but can be exceedingly inefficient as preference learning is often tabula rasa. We demonstrate that Large Language Models (LLMs) have native preference-learning capabilities that allow them to achieve sample-efficient preference learning, addressing this challenge. We propose In-Context Preference Learning (ICPL), which uses in-context learning capabilities of LLMs to reduce human query inefficiency. ICPL uses the task description and basic environment code to create sets of reward functions which are iteratively refined by placing human feedback over videos of the resultant policies into the context of an LLM and then requesting better rewards. We first demonstrate ICPL's effectiveness through a synthetic preference study, providing quantitative evidence that it significantly outperforms baseline preference-based methods with much higher performance and orders of magnitude greater efficiency. We observe that these improvements are not solely coming from LLM grounding in the task but that the quality of the rewards improves over time, indicating preference learning capabilities. Additionally, we perform a series of real human preference-learning trials and observe that ICPL extends beyond synthetic settings and can work effectively with humans-in-the-loop.
