Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback
Qinqing Zheng, Mikael Henaff, Amy Zhang, Aditya Grover, Brandon Amos
TL;DR
The paper tackles the challenge of reward shaping in sparse RL by enabling online, scalable intrinsic rewards learned from large-language-model feedback. It introduces ONI, a distributed architecture that pairs asynchronous LLM annotations with a PPO-based agent to jointly learn the policy and an intrinsic reward model, eliminating the need for large pre-collected datasets. Three intrinsic-reward modeling methods—retrieval, classification, and ranking—are explored, offering trade-offs between simplicity, generalization, and semantic understanding. Empirically, ONI achieves state-of-the-art performance on the NetHack Learning Environment across sparse-reward tasks, often matching Motif without offline data, and demonstrates robust performance across different LLM sizes and ablation conditions. The work advances scalable, data-efficient RL by leveraging online LLM feedback to shape intrinsic motivation, with practical implications for open-ended tasks and high-throughput training.
Abstract
Automatically synthesizing dense rewards from natural language descriptions is a promising paradigm in reinforcement learning (RL), with applications to sparse reward problems, open-ended exploration, and hierarchical skill design. Recent works have made promising steps by exploiting the prior knowledge of large language models (LLMs). However, these approaches suffer from important limitations: they are either not scalable to problems requiring billions of environment samples, due to requiring LLM annotations for each observation, or they require a diverse offline dataset, which may not exist or be impossible to collect. In this work, we address these limitations through a combination of algorithmic and systems-level contributions. We propose ONI, a distributed architecture that simultaneously learns an RL policy and an intrinsic reward function using LLM feedback. Our approach annotates the agent's collected experience via an asynchronous LLM server, which is then distilled into an intrinsic reward model. We explore a range of algorithmic choices for reward modeling with varying complexity, including hashing, classification, and ranking models. Our approach achieves state-of-the-art performance across a range of challenging tasks from the NetHack Learning Environment, while removing the need for large offline datasets required by prior work. We make our code available at https://github.com/facebookresearch/oni.
