Table of Contents
Fetching ...

Online Preference-based Reinforcement Learning with Self-augmented Feedback from Large Language Model

Songjun Tu, Jingbo Sun, Qichao Zhang, Xiangyuan Lan, Dongbin Zhao

TL;DR

PbRL in online settings requires real-time feedback or privileged rewards, which are often impractical. This work proposes RL-SaLLM-F, which replaces a scripted teacher with self-augmented feedback from LLMs, including imagined trajectories and a double-check mechanism to produce reliable preference labels, and trains a reward model without privileged information. It introduces a three-stage loop—unsupervised pre-training with intrinsic rewards, LLM-based labeling with self-augmentation, and policy learning via relabeled SAC updates—and demonstrates comparable or superior performance to scripted-teacher baselines on MetaWorld tasks, using a lightweight GPT-4o-mini and stronger results with GPT-4o at higher cost. The approach offers a practical, scalable online PbRL framework that leverages LLM-driven feedback to reduce annotation burden while maintaining performance across robotic manipulation tasks and beyond.

Abstract

Preference-based reinforcement learning (PbRL) provides a powerful paradigm to avoid meticulous reward engineering by learning rewards based on human preferences. However, real-time human feedback is hard to obtain in online tasks. Most work suppose there is a "scripted teacher" that utilizes privileged predefined reward to provide preference feedback. In this paper, we propose a RL Self-augmented Large Language Model Feedback (RL-SaLLM-F) technique that does not rely on privileged information for online PbRL. RL-SaLLM-F leverages the reflective and discriminative capabilities of LLM to generate self-augmented trajectories and provide preference labels for reward learning. First, we identify an failure issue in LLM-based preference discrimination, specifically "query ambiguity", in online PbRL. Then LLM is employed to provide preference labels and generate self-augmented imagined trajectories that better achieve the task goal, thereby enhancing the quality and efficiency of feedback. Additionally, a double-check mechanism is introduced to mitigate randomness in the preference labels, improving the reliability of LLM feedback. The experiment across multiple tasks in the MetaWorld benchmark demonstrates the specific contributions of each proposed module in RL-SaLLM-F, and shows that self-augmented LLM feedback can effectively replace the impractical "scripted teacher" feedback. In summary, RL-SaLLM-F introduces a new direction of feedback acquisition in online PbRL that does not rely on any online privileged information, offering an efficient and lightweight solution with LLM-driven feedback.

Online Preference-based Reinforcement Learning with Self-augmented Feedback from Large Language Model

TL;DR

PbRL in online settings requires real-time feedback or privileged rewards, which are often impractical. This work proposes RL-SaLLM-F, which replaces a scripted teacher with self-augmented feedback from LLMs, including imagined trajectories and a double-check mechanism to produce reliable preference labels, and trains a reward model without privileged information. It introduces a three-stage loop—unsupervised pre-training with intrinsic rewards, LLM-based labeling with self-augmentation, and policy learning via relabeled SAC updates—and demonstrates comparable or superior performance to scripted-teacher baselines on MetaWorld tasks, using a lightweight GPT-4o-mini and stronger results with GPT-4o at higher cost. The approach offers a practical, scalable online PbRL framework that leverages LLM-driven feedback to reduce annotation burden while maintaining performance across robotic manipulation tasks and beyond.

Abstract

Preference-based reinforcement learning (PbRL) provides a powerful paradigm to avoid meticulous reward engineering by learning rewards based on human preferences. However, real-time human feedback is hard to obtain in online tasks. Most work suppose there is a "scripted teacher" that utilizes privileged predefined reward to provide preference feedback. In this paper, we propose a RL Self-augmented Large Language Model Feedback (RL-SaLLM-F) technique that does not rely on privileged information for online PbRL. RL-SaLLM-F leverages the reflective and discriminative capabilities of LLM to generate self-augmented trajectories and provide preference labels for reward learning. First, we identify an failure issue in LLM-based preference discrimination, specifically "query ambiguity", in online PbRL. Then LLM is employed to provide preference labels and generate self-augmented imagined trajectories that better achieve the task goal, thereby enhancing the quality and efficiency of feedback. Additionally, a double-check mechanism is introduced to mitigate randomness in the preference labels, improving the reliability of LLM feedback. The experiment across multiple tasks in the MetaWorld benchmark demonstrates the specific contributions of each proposed module in RL-SaLLM-F, and shows that self-augmented LLM feedback can effectively replace the impractical "scripted teacher" feedback. In summary, RL-SaLLM-F introduces a new direction of feedback acquisition in online PbRL that does not rely on any online privileged information, offering an efficient and lightweight solution with LLM-driven feedback.

Paper Structure

This paper contains 39 sections, 6 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Query ambiguities in online PbRL. (a) Two failed trajectories are provided, converted into text form, and input into the LLM, showing that the LLM struggles to evaluate such trajectories. Specific examples can be found in Appendix \ref{['app:d']}; (b) Training curves of PEBBLE with LLM feedback. The blue line represents the labeling accuracy of the LLM and the red line represents the predefined episode rewards. (c) Training curves of double-check mechanism and additional self-augmented LLM feedback.
  • Figure 2: The overall framework of RL-SaLLM-F. First, trajectories are sampled from the replay buffer and converted into coordinate text descriptions. Next, the text representation of the trajectory pairs are selected and queried through the LLM twice with different orderings to obtain feedback labels. Subsequently, based on the sampled trajectories, the 'imagined' trajectories that better achieve the goal are generated by the LLM to train the reward model.
  • Figure 3: Learning curves of all compared methods on 8 tasks. Results are averaged over 5 seeds, and shaded regions represent standard error. RL-SaLLM-F masters robotic manipulation without any online privileged reward, performing on par with PEBBLE, which uses 'scripted teacher' feedback, and even SAC with predefined reward functions in partial tasks.
  • Figure 4: Learning curves of the ablation study. When any component of RL-SaLLM-F is removed, the performance decreases. Specifically, the absence of self-augmented feedback leads to notably poor success rate.
  • Figure 5: Normalized step rewards for the expert and suboptimal trajectories in the Button Press task. The rewards of RL-SaLLM-F show better alignment with the predefined task reward than the ablation variants, particularly in the suboptimal trajectory.
  • ...and 8 more figures