Improving Sample Efficiency of Reinforcement Learning with Background Knowledge from Large Language Models
Fuxiang Zhang, Junyou Li, Yi-Chen Li, Zongzhang Zhang, Yang Yu, Deheng Ye
TL;DR
Reinforcement learning often suffers from poor sample efficiency, especially in sparse-reward settings. The authors propose a framework that grounds large language models on a pre-collected dataset to extract environment-wide background knowledge, representing it as a potential function $\\phi(s)$ and using potential-based reward shaping with $F(s,s')=\\gamma\\phi(s')-\\phi(s)$ to accelerate downstream RL while preserving policy optimality ($Q'(s,a)=Q(s,a)-\\phi(s)$). They instantiate three prompting variants—BK-Code, BK-Pref, and BK-Goal—to convert knowledge into actionable signals, and demonstrate significant gains in Minigrid and Crafter. The work also shows generalization to unseen tasks and analyzes sensitivity to LLM choice and data quality, while remaining offline during RL to improve practicality. Overall, the approach provides a reusable, environment-centric source of guidance that reduces online LLM querying and enhances scalability to larger or more diverse tasks.
Abstract
Low sample efficiency is an enduring challenge of reinforcement learning (RL). With the advent of versatile large language models (LLMs), recent works impart common-sense knowledge to accelerate policy learning for RL processes. However, we note that such guidance is often tailored for one specific task but loses generalizability. In this paper, we introduce a framework that harnesses LLMs to extract background knowledge of an environment, which contains general understandings of the entire environment, making various downstream RL tasks benefit from one-time knowledge representation. We ground LLMs by feeding a few pre-collected experiences and requesting them to delineate background knowledge of the environment. Afterward, we represent the output knowledge as potential functions for potential-based reward shaping, which has a good property for maintaining policy optimality from task rewards. We instantiate three variants to prompt LLMs for background knowledge, including writing code, annotating preferences, and assigning goals. Our experiments show that these methods achieve significant sample efficiency improvements in a spectrum of downstream tasks from Minigrid and Crafter domains.
