Table of Contents
Fetching ...

Automated Rewards via LLM-Generated Progress Functions

Vishnu Sarukkai, Brennan Shacklett, Zander Majercik, Kush Bhatia, Christopher Ré, Kayvon Fatahalian

TL;DR

This paper introduces an LLM-driven reward generation framework that is able to produce state-of-the-art policies on the challenging Bi-DexHands benchmark with 20x fewer reward function samples than the prior state-of-the-art work.

Abstract

Large Language Models (LLMs) have the potential to automate reward engineering by leveraging their broad domain knowledge across various tasks. However, they often need many iterations of trial-and-error to generate effective reward functions. This process is costly because evaluating every sampled reward function requires completing the full policy optimization process for each function. In this paper, we introduce an LLM-driven reward generation framework that is able to produce state-of-the-art policies on the challenging Bi-DexHands benchmark with 20x fewer reward function samples than the prior state-of-the-art work. Our key insight is that we reduce the problem of generating task-specific rewards to the problem of coarsely estimating task progress. Our two-step solution leverages the task domain knowledge and the code synthesis abilities of LLMs to author progress functions that estimate task progress from a given state. Then, we use this notion of progress to discretize states, and generate count-based intrinsic rewards using the low-dimensional state space. We show that the combination of LLM-generated progress functions and count-based intrinsic rewards is essential for our performance gains, while alternatives such as generic hash-based counts or using progress directly as a reward function fall short.

Automated Rewards via LLM-Generated Progress Functions

TL;DR

This paper introduces an LLM-driven reward generation framework that is able to produce state-of-the-art policies on the challenging Bi-DexHands benchmark with 20x fewer reward function samples than the prior state-of-the-art work.

Abstract

Large Language Models (LLMs) have the potential to automate reward engineering by leveraging their broad domain knowledge across various tasks. However, they often need many iterations of trial-and-error to generate effective reward functions. This process is costly because evaluating every sampled reward function requires completing the full policy optimization process for each function. In this paper, we introduce an LLM-driven reward generation framework that is able to produce state-of-the-art policies on the challenging Bi-DexHands benchmark with 20x fewer reward function samples than the prior state-of-the-art work. Our key insight is that we reduce the problem of generating task-specific rewards to the problem of coarsely estimating task progress. Our two-step solution leverages the task domain knowledge and the code synthesis abilities of LLMs to author progress functions that estimate task progress from a given state. Then, we use this notion of progress to discretize states, and generate count-based intrinsic rewards using the low-dimensional state space. We show that the combination of LLM-generated progress functions and count-based intrinsic rewards is essential for our performance gains, while alternatives such as generic hash-based counts or using progress directly as a reward function fall short.

Paper Structure

This paper contains 46 sections, 2 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: ProgressCounts: an algorithm for reward generation via LLM-generated task progress functions and count-based rewards. (A) We leverage a LLM to generate code for a progress function, which distills task-specific features from a high-dimensional state space into a low-dimensional notion of task progress. The LLM takes as input a high-level task description, a small library of feature engineering functions, and a description of the environment state space. On a per-task basis, the user only needs to provide the task description as input. (B) We use heuristics to discretize the output of the LLM-generated progress function, compute state visitation counts across the discretized bins, and leverage standard count-based rewards to learn RL policies.
  • Figure 2: On the Bi-DexHands benchmark, ProgressCounts produces policies that perform comparably to those of Eureka in terms of average task success rate, at a much smaller sample budget. Eureka's evolutionary algorithm requires $48$ policy samples (training runs with different generated reward functions) to find a policy whose performance matches that of human-designed dense reward functions. ProgressCounts requires only four policy samples (different progress functions), generating a policy that outperforms the human-designed baseline and exceeding the peak performance achieved by Eureka after $80$ policy samples (20$\times$ the cost of ProgressCounts).
  • Figure 3: ProgressCounts produces policies whose performance (in terms of task success rate) matches or exceeds that of the prior state-of-the-art method (Eureka) on 13 of 20 tasks in the Bi-DexHands benchmark. Sparse rewards (Sparse) struggle to learn effective policies for most Bi-DexHands tasks. See Appendix \ref{['appendix:bidex']} for results in tabular form.
  • Figure 4: By allocating many environment samples to a single training run, ProgressCounts trains a policy that achieved high success on the challenging TwoCatchUnderarm task. All baselines achieved zero success on this task given a two billion environment sample budget.
  • Figure 5: Training curves for ProgressCounts on 8 hard-exploration MiniGrid tasks.