Table of Contents
Fetching ...

PrefCLM: Enhancing Preference-based Reinforcement Learning with Crowdsourced Large Language Models

Ruiqi Wang, Dezhong Zhao, Ziqin Yuan, Ike Obi, Byung-Cheol Min

TL;DR

PrefCLM tackles the reward-engineering bottleneck in preference-based reinforcement learning by crowdsourcing synthetic feedback from multiple LLMs. It fuses diverse evaluations with Dempster–Shafer Theory and incorporates HITL to tailor robot behavior to individual users in HRI tasks. The approach achieves competitive performance against expert-tuned scripted teachers across general RL tasks and substantially improves user satisfaction and personalization in a real-world feeding scenario. This framework offers a plug-and-play enhancement for PbRL, enabling flexible, scalable, and human-aligned robot learning without extensive reward engineering.

Abstract

Preference-based reinforcement learning (PbRL) is emerging as a promising approach to teaching robots through human comparative feedback, sidestepping the need for complex reward engineering. However, the substantial volume of feedback required in existing PbRL methods often lead to reliance on synthetic feedback generated by scripted teachers. This approach necessitates intricate reward engineering again and struggles to adapt to the nuanced preferences particular to human-robot interaction (HRI) scenarios, where users may have unique expectations toward the same task. To address these challenges, we introduce PrefCLM, a novel framework that utilizes crowdsourced large language models (LLMs) as simulated teachers in PbRL. We utilize Dempster-Shafer Theory to fuse individual preferences from multiple LLM agents at the score level, efficiently leveraging their diversity and collective intelligence. We also introduce a human-in-the-loop pipeline that facilitates collective refinements based on user interactive feedback. Experimental results across various general RL tasks show that PrefCLM achieves competitive performance compared to traditional scripted teachers and excels in facilitating more more natural and efficient behaviors. A real-world user study (N=10) further demonstrates its capability to tailor robot behaviors to individual user preferences, significantly enhancing user satisfaction in HRI scenarios.

PrefCLM: Enhancing Preference-based Reinforcement Learning with Crowdsourced Large Language Models

TL;DR

PrefCLM tackles the reward-engineering bottleneck in preference-based reinforcement learning by crowdsourcing synthetic feedback from multiple LLMs. It fuses diverse evaluations with Dempster–Shafer Theory and incorporates HITL to tailor robot behavior to individual users in HRI tasks. The approach achieves competitive performance against expert-tuned scripted teachers across general RL tasks and substantially improves user satisfaction and personalization in a real-world feeding scenario. This framework offers a plug-and-play enhancement for PbRL, enabling flexible, scalable, and human-aligned robot learning without extensive reward engineering.

Abstract

Preference-based reinforcement learning (PbRL) is emerging as a promising approach to teaching robots through human comparative feedback, sidestepping the need for complex reward engineering. However, the substantial volume of feedback required in existing PbRL methods often lead to reliance on synthetic feedback generated by scripted teachers. This approach necessitates intricate reward engineering again and struggles to adapt to the nuanced preferences particular to human-robot interaction (HRI) scenarios, where users may have unique expectations toward the same task. To address these challenges, we introduce PrefCLM, a novel framework that utilizes crowdsourced large language models (LLMs) as simulated teachers in PbRL. We utilize Dempster-Shafer Theory to fuse individual preferences from multiple LLM agents at the score level, efficiently leveraging their diversity and collective intelligence. We also introduce a human-in-the-loop pipeline that facilitates collective refinements based on user interactive feedback. Experimental results across various general RL tasks show that PrefCLM achieves competitive performance compared to traditional scripted teachers and excels in facilitating more more natural and efficient behaviors. A real-world user study (N=10) further demonstrates its capability to tailor robot behaviors to individual user preferences, significantly enhancing user satisfaction in HRI scenarios.
Paper Structure (35 sections, 12 equations, 8 figures, 4 tables)

This paper contains 35 sections, 12 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Conceptual illustration of the LLM-based crowdsourced evaluation. LLM instructors are symbolized by animal icons, with distinct species presenting varied LLM architectures. These crowd instructors, through their unique evaluative criteria and reasoning, determine individual preferences for robot trajectories, which are fused to formulate a unified crowdsourced preference used for PbRL process.
  • Figure 2: Overview of the PrefCLM framework. (a) Given task-specific contextual information and prompts, multiple code-based evaluation functions are sampled from crowd LLM agents (Section \ref{['sec:function_sampling']}). (b) A cosine similarity check module then filters the sampled evaluation functions, selecting those that align with few-shot expert preferences within a specified tolerance (optional, Section \ref{['sec:function_filtering']}). (c) Evaluative scores are continuously assigned by these selected evaluation functions to pairs of robot trajectories. These scores are aggregated through Dempster-Shafer Theory (DST) fusion to form crowdsourced preferences, which are used for the reward learning in PbRL (Section \ref{['sec:preference_fusion']}). (d) Crowd LLM agents can also collectively adapt and refine their evaluation functions based on user interactive inputs given periodically in HRI scenarios (optional, Section \ref{['sec:hri']}).
  • Figure 3: Learning curves on general RL tasks, measured in episode returns for locomotion tasks and success rates for manipulation tasks. The solid line represents the mean, while the shaded area indicates the standard deviation across five runs.
  • Figure 4: Locomotion behaviors learned by the Scripted Teachers (top) and PrefCLM (bottom) on the Cheetah Run task.
  • Figure 5: Results of ablation studies in terms of learning curves with a moving window average of 100 applied for readability. HO: homogeneous setting; HE: heterogeneous setting; N: number of LLM agents in the crowd.
  • ...and 3 more figures