Table of Contents
Fetching ...

Learning Reward for Robot Skills Using Large Language Models via Self-Alignment

Yuwei Zeng, Yao Mu, Lin Shao

TL;DR

The paper tackles the bottleneck of reward design for robotic skill learning by leveraging Large Language Models (LLMs) to propose reward features and parameterization, and then grounding these proposals through a self-alignment loop that aligns LLM-based trajectory rankings with environment-driven execution feedback. The method adopts a bi-level optimization: an inner loop optimizes the policy under the current reward, while an outer loop updates the reward parameters via ranking-based Bayesian updates (Bradley–Terry with Boltzmann rationality) and active LLM-driven refinements when discrepancies arise. Empirically, the approach is validated on 9 tasks across 2 simulation environments, achieving near-oracle performance on several ManiSkill2 tasks, faster convergence, and substantial reductions in GPT-token usage compared to a mutation-based baseline. The results demonstrate that LLM-guided reward design, coupled with self-alignment, can reduce human supervision while enhancing training efficacy and efficiency for a broad set of robotic manipulation skills. This framework has practical significance for scalable, autonomous reward design in robotics and could accelerate development of diverse, robust policies with limited human input.

Abstract

Learning reward functions remains the bottleneck to equip a robot with a broad repertoire of skills. Large Language Models (LLM) contain valuable task-related knowledge that can potentially aid in the learning of reward functions. However, the proposed reward function can be imprecise, thus ineffective which requires to be further grounded with environment information. We proposed a method to learn rewards more efficiently in the absence of humans. Our approach consists of two components: We first use the LLM to propose features and parameterization of the reward, then update the parameters through an iterative self-alignment process. In particular, the process minimizes the ranking inconsistency between the LLM and the learnt reward functions based on the execution feedback. The method was validated on 9 tasks across 2 simulation environments. It demonstrates a consistent improvement over training efficacy and efficiency, meanwhile consuming significantly fewer GPT tokens compared to the alternative mutation-based method.

Learning Reward for Robot Skills Using Large Language Models via Self-Alignment

TL;DR

The paper tackles the bottleneck of reward design for robotic skill learning by leveraging Large Language Models (LLMs) to propose reward features and parameterization, and then grounding these proposals through a self-alignment loop that aligns LLM-based trajectory rankings with environment-driven execution feedback. The method adopts a bi-level optimization: an inner loop optimizes the policy under the current reward, while an outer loop updates the reward parameters via ranking-based Bayesian updates (Bradley–Terry with Boltzmann rationality) and active LLM-driven refinements when discrepancies arise. Empirically, the approach is validated on 9 tasks across 2 simulation environments, achieving near-oracle performance on several ManiSkill2 tasks, faster convergence, and substantial reductions in GPT-token usage compared to a mutation-based baseline. The results demonstrate that LLM-guided reward design, coupled with self-alignment, can reduce human supervision while enhancing training efficacy and efficiency for a broad set of robotic manipulation skills. This framework has practical significance for scalable, autonomous reward design in robotics and could accelerate development of diverse, robust policies with limited human input.

Abstract

Learning reward functions remains the bottleneck to equip a robot with a broad repertoire of skills. Large Language Models (LLM) contain valuable task-related knowledge that can potentially aid in the learning of reward functions. However, the proposed reward function can be imprecise, thus ineffective which requires to be further grounded with environment information. We proposed a method to learn rewards more efficiently in the absence of humans. Our approach consists of two components: We first use the LLM to propose features and parameterization of the reward, then update the parameters through an iterative self-alignment process. In particular, the process minimizes the ranking inconsistency between the LLM and the learnt reward functions based on the execution feedback. The method was validated on 9 tasks across 2 simulation environments. It demonstrates a consistent improvement over training efficacy and efficiency, meanwhile consuming significantly fewer GPT tokens compared to the alternative mutation-based method.
Paper Structure (42 sections, 2 equations, 11 figures, 4 tables, 1 algorithm)

This paper contains 42 sections, 2 equations, 11 figures, 4 tables, 1 algorithm.

Figures (11)

  • Figure 1: The overview of our method. We learn the reward function using LLM with a bi-level optimization structure. We first use the LLM to propose features and parameterization of the reward function. Next, we update the parameters of this proposed reward function through an iterative self-alignment process. In particular, this process minimizes the ranking inconsistency between the LLM and our learned reward functions based on the new observations.
  • Figure 2: Six evaluation tasks from ManiSkill2: PickCube, PickSingleYCB, PegInsertionSide, OpenCabinetDoor, OpenCabinetDrawer, PushChair.
  • Figure 3: Three Isaac Gym evaluation tasks: Franka Cabinet, Shadow Hand Open Door Outwards, Shadow Hand Open Scissor
  • Figure 4: (a) parameter update over iteration for pick cube task; To better visualize the early shift the update is truncated to 50 iterations. (b) success rate of policy trained with the actively adjusted reward function and with the adjusted final reward function. The policy may not achieve the same performance trained with the final reward function learnt only.
  • Figure 5: Success rates vs exploration steps on 6 ManiSkill Tasks with SAC. The updated reward is able to produce policy with similar performance to that is trained with oracle reward on 5 tasks. Compared to using fixed reward function genreated by LLM, our approach consistently improves the training with faster convergence rate and/or higher convergence performance
  • ...and 6 more figures