Table of Contents
Fetching ...

Boosting Universal LLM Reward Design through Heuristic Reward Observation Space Evolution

Zen Kit Heng, Zimeng Zhao, Tianhao Wu, Yuanfei Wang, Mingdong Wu, Yangang Wang, Hao Dong

TL;DR

This paper addresses the challenge of designing universal RL rewards with LLMs by introducing a Reward Observation Space (ROS) that is evolved through heuristic sampling. A state execution table (SET) and a disentangled ROS (ROS_st and ROS_op) enable more thorough yet efficient exploration, while a text-code reconciliation step aligns user task descriptions with expert success criteria via a separate LLM. The framework iteratively generates reward samples, evaluates them via a fitness function, and records successful configurations in memory to guide future iterations. Empirical results on Bi-dexterous Manipulation tasks show improved stability and performance over baselines like Eureka, with ablations highlighting the contributions of ROS memory and reconciliation. This approach advances universal LLM-driven reward design, offering a scalable pathway to automate reward design across diverse robotic tasks with minimal human intervention.

Abstract

Large Language Models (LLMs) are emerging as promising tools for automated reinforcement learning (RL) reward design, owing to their robust capabilities in commonsense reasoning and code generation. By engaging in dialogues with RL agents, LLMs construct a Reward Observation Space (ROS) by selecting relevant environment states and defining their internal operations. However, existing frameworks have not effectively leveraged historical exploration data or manual task descriptions to iteratively evolve this space. In this paper, we propose a novel heuristic framework that enhances LLM-driven reward design by evolving the ROS through a table-based exploration caching mechanism and a text-code reconciliation strategy. Our framework introduces a state execution table, which tracks the historical usage and success rates of environment states, overcoming the Markovian constraint typically found in LLM dialogues and facilitating more effective exploration. Furthermore, we reconcile user-provided task descriptions with expert-defined success criteria using structured prompts, ensuring alignment in reward design objectives. Comprehensive evaluations on benchmark RL tasks demonstrate the effectiveness and stability of the proposed framework. Code and video demos are available at jingjjjjjie.github.io/LLM2Reward.

Boosting Universal LLM Reward Design through Heuristic Reward Observation Space Evolution

TL;DR

This paper addresses the challenge of designing universal RL rewards with LLMs by introducing a Reward Observation Space (ROS) that is evolved through heuristic sampling. A state execution table (SET) and a disentangled ROS (ROS_st and ROS_op) enable more thorough yet efficient exploration, while a text-code reconciliation step aligns user task descriptions with expert success criteria via a separate LLM. The framework iteratively generates reward samples, evaluates them via a fitness function, and records successful configurations in memory to guide future iterations. Empirical results on Bi-dexterous Manipulation tasks show improved stability and performance over baselines like Eureka, with ablations highlighting the contributions of ROS memory and reconciliation. This approach advances universal LLM-driven reward design, offering a scalable pathway to automate reward design across diverse robotic tasks with minimal human intervention.

Abstract

Large Language Models (LLMs) are emerging as promising tools for automated reinforcement learning (RL) reward design, owing to their robust capabilities in commonsense reasoning and code generation. By engaging in dialogues with RL agents, LLMs construct a Reward Observation Space (ROS) by selecting relevant environment states and defining their internal operations. However, existing frameworks have not effectively leveraged historical exploration data or manual task descriptions to iteratively evolve this space. In this paper, we propose a novel heuristic framework that enhances LLM-driven reward design by evolving the ROS through a table-based exploration caching mechanism and a text-code reconciliation strategy. Our framework introduces a state execution table, which tracks the historical usage and success rates of environment states, overcoming the Markovian constraint typically found in LLM dialogues and facilitating more effective exploration. Furthermore, we reconcile user-provided task descriptions with expert-defined success criteria using structured prompts, ensuring alignment in reward design objectives. Comprehensive evaluations on benchmark RL tasks demonstrate the effectiveness and stability of the proposed framework. Code and video demos are available at jingjjjjjie.github.io/LLM2Reward.

Paper Structure

This paper contains 13 sections, 7 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Comparison diagram of evolutionary process.(a) Eureka's evaluation and sampling. (b) Our evaluation and sampling.
  • Figure 2: The pipeline of our proposed framework for heuristic Reward Observation Space (ROS) evolution in LLM-aided RL reward design. (a) User-expert Mission Reconciling. (b) Observation Space Disentanglement. (c) Reward State Execution. (d) Reward Item Performance.
  • Figure 3: Schematic illustration of the difference in sampling process for different on reward space$\mathcal{R}$. Compared to Eureka ma2023eureka, observation Space disentanglement improves the efficiency of the sampling process of LLM by reducing the degrees of freedom.
  • Figure 4: Reward evolution in the first 3 iterations of our framework on BlockGrasp task. (a) Each sample in iteration 2 keeps the same $\text{ROS}_{st}$ as $R_1^\star$. (b) Each sample in iteration 3 keeps the similar $\text{ROS}_{op}$ as $R_2^\star$. In each iteration, the abstract part the highest $\mathcal{F}_{sc}(\cdot)$ (identified by the red box) and two additional executable rewards are shown.
  • Figure 5: Comparison with existing LLM reward design approaches. The subgraphs report the success rates on the 20 dexterity tasks on the Bi-dexterous Manipulation benchmark chen2022towards.