Table of Contents
Fetching ...

Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback

Qinqing Zheng, Mikael Henaff, Amy Zhang, Aditya Grover, Brandon Amos

TL;DR

The paper tackles the challenge of reward shaping in sparse RL by enabling online, scalable intrinsic rewards learned from large-language-model feedback. It introduces ONI, a distributed architecture that pairs asynchronous LLM annotations with a PPO-based agent to jointly learn the policy and an intrinsic reward model, eliminating the need for large pre-collected datasets. Three intrinsic-reward modeling methods—retrieval, classification, and ranking—are explored, offering trade-offs between simplicity, generalization, and semantic understanding. Empirically, ONI achieves state-of-the-art performance on the NetHack Learning Environment across sparse-reward tasks, often matching Motif without offline data, and demonstrates robust performance across different LLM sizes and ablation conditions. The work advances scalable, data-efficient RL by leveraging online LLM feedback to shape intrinsic motivation, with practical implications for open-ended tasks and high-throughput training.

Abstract

Automatically synthesizing dense rewards from natural language descriptions is a promising paradigm in reinforcement learning (RL), with applications to sparse reward problems, open-ended exploration, and hierarchical skill design. Recent works have made promising steps by exploiting the prior knowledge of large language models (LLMs). However, these approaches suffer from important limitations: they are either not scalable to problems requiring billions of environment samples, due to requiring LLM annotations for each observation, or they require a diverse offline dataset, which may not exist or be impossible to collect. In this work, we address these limitations through a combination of algorithmic and systems-level contributions. We propose ONI, a distributed architecture that simultaneously learns an RL policy and an intrinsic reward function using LLM feedback. Our approach annotates the agent's collected experience via an asynchronous LLM server, which is then distilled into an intrinsic reward model. We explore a range of algorithmic choices for reward modeling with varying complexity, including hashing, classification, and ranking models. Our approach achieves state-of-the-art performance across a range of challenging tasks from the NetHack Learning Environment, while removing the need for large offline datasets required by prior work. We make our code available at https://github.com/facebookresearch/oni.

Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback

TL;DR

The paper tackles the challenge of reward shaping in sparse RL by enabling online, scalable intrinsic rewards learned from large-language-model feedback. It introduces ONI, a distributed architecture that pairs asynchronous LLM annotations with a PPO-based agent to jointly learn the policy and an intrinsic reward model, eliminating the need for large pre-collected datasets. Three intrinsic-reward modeling methods—retrieval, classification, and ranking—are explored, offering trade-offs between simplicity, generalization, and semantic understanding. Empirically, ONI achieves state-of-the-art performance on the NetHack Learning Environment across sparse-reward tasks, often matching Motif without offline data, and demonstrates robust performance across different LLM sizes and ablation conditions. The work advances scalable, data-efficient RL by leveraging online LLM feedback to shape intrinsic motivation, with practical implications for open-ended tasks and high-throughput training.

Abstract

Automatically synthesizing dense rewards from natural language descriptions is a promising paradigm in reinforcement learning (RL), with applications to sparse reward problems, open-ended exploration, and hierarchical skill design. Recent works have made promising steps by exploiting the prior knowledge of large language models (LLMs). However, these approaches suffer from important limitations: they are either not scalable to problems requiring billions of environment samples, due to requiring LLM annotations for each observation, or they require a diverse offline dataset, which may not exist or be impossible to collect. In this work, we address these limitations through a combination of algorithmic and systems-level contributions. We propose ONI, a distributed architecture that simultaneously learns an RL policy and an intrinsic reward function using LLM feedback. Our approach annotates the agent's collected experience via an asynchronous LLM server, which is then distilled into an intrinsic reward model. We explore a range of algorithmic choices for reward modeling with varying complexity, including hashing, classification, and ranking models. Our approach achieves state-of-the-art performance across a range of challenging tasks from the NetHack Learning Environment, while removing the need for large offline datasets required by prior work. We make our code available at https://github.com/facebookresearch/oni.

Paper Structure

This paper contains 27 sections, 8 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Overview of Motif klissarovdoro2023motif and ONI.
  • Figure 2: Overall system diagram of ONI. Our additions to Sample Factory are highlighted in blue. We added an asynchronously executing LLM server and learned reward function, and connect them back into the main learning process in a way that does not hurt the overall throughput of the policy and value learning.
  • Figure 3: ONI-based methods are able to match or closely track the performance of Motif without using an pre-collected dataset. This includes (a) reward-based and (b) reward-free settings. Motif's pre-collected dataset uses privileged information about dense reward functions to solve sparse-reward or reward-free environments while ONI-methods do not. ELLM-BoW demonstrated to be a competitive baseline here too.
  • Figure 4: (a)ELLM-BoW is not able to understand the semantic meaning of complex goals, resulting in agents with similar behavior under the combat and the gold goal. (b)ONI-retrieval can distinguish the goals and the resulting agents focus on different aspects of game progress.
  • Figure 5: Performance remains comparable despite doubling LLM annotation throughput. ONI-ranking's throughput is lower than the others due to annotating pairs of captions.
  • ...and 7 more figures