Table of Contents
Fetching ...

Improving Sample Efficiency of Reinforcement Learning with Background Knowledge from Large Language Models

Fuxiang Zhang, Junyou Li, Yi-Chen Li, Zongzhang Zhang, Yang Yu, Deheng Ye

TL;DR

Reinforcement learning often suffers from poor sample efficiency, especially in sparse-reward settings. The authors propose a framework that grounds large language models on a pre-collected dataset to extract environment-wide background knowledge, representing it as a potential function $\\phi(s)$ and using potential-based reward shaping with $F(s,s')=\\gamma\\phi(s')-\\phi(s)$ to accelerate downstream RL while preserving policy optimality ($Q'(s,a)=Q(s,a)-\\phi(s)$). They instantiate three prompting variants—BK-Code, BK-Pref, and BK-Goal—to convert knowledge into actionable signals, and demonstrate significant gains in Minigrid and Crafter. The work also shows generalization to unseen tasks and analyzes sensitivity to LLM choice and data quality, while remaining offline during RL to improve practicality. Overall, the approach provides a reusable, environment-centric source of guidance that reduces online LLM querying and enhances scalability to larger or more diverse tasks.

Abstract

Low sample efficiency is an enduring challenge of reinforcement learning (RL). With the advent of versatile large language models (LLMs), recent works impart common-sense knowledge to accelerate policy learning for RL processes. However, we note that such guidance is often tailored for one specific task but loses generalizability. In this paper, we introduce a framework that harnesses LLMs to extract background knowledge of an environment, which contains general understandings of the entire environment, making various downstream RL tasks benefit from one-time knowledge representation. We ground LLMs by feeding a few pre-collected experiences and requesting them to delineate background knowledge of the environment. Afterward, we represent the output knowledge as potential functions for potential-based reward shaping, which has a good property for maintaining policy optimality from task rewards. We instantiate three variants to prompt LLMs for background knowledge, including writing code, annotating preferences, and assigning goals. Our experiments show that these methods achieve significant sample efficiency improvements in a spectrum of downstream tasks from Minigrid and Crafter domains.

Improving Sample Efficiency of Reinforcement Learning with Background Knowledge from Large Language Models

TL;DR

Reinforcement learning often suffers from poor sample efficiency, especially in sparse-reward settings. The authors propose a framework that grounds large language models on a pre-collected dataset to extract environment-wide background knowledge, representing it as a potential function and using potential-based reward shaping with to accelerate downstream RL while preserving policy optimality (). They instantiate three prompting variants—BK-Code, BK-Pref, and BK-Goal—to convert knowledge into actionable signals, and demonstrate significant gains in Minigrid and Crafter. The work also shows generalization to unseen tasks and analyzes sensitivity to LLM choice and data quality, while remaining offline during RL to improve practicality. Overall, the approach provides a reusable, environment-centric source of guidance that reduces online LLM querying and enhances scalability to larger or more diverse tasks.

Abstract

Low sample efficiency is an enduring challenge of reinforcement learning (RL). With the advent of versatile large language models (LLMs), recent works impart common-sense knowledge to accelerate policy learning for RL processes. However, we note that such guidance is often tailored for one specific task but loses generalizability. In this paper, we introduce a framework that harnesses LLMs to extract background knowledge of an environment, which contains general understandings of the entire environment, making various downstream RL tasks benefit from one-time knowledge representation. We ground LLMs by feeding a few pre-collected experiences and requesting them to delineate background knowledge of the environment. Afterward, we represent the output knowledge as potential functions for potential-based reward shaping, which has a good property for maintaining policy optimality from task rewards. We instantiate three variants to prompt LLMs for background knowledge, including writing code, annotating preferences, and assigning goals. Our experiments show that these methods achieve significant sample efficiency improvements in a spectrum of downstream tasks from Minigrid and Crafter domains.
Paper Structure (17 sections, 8 equations, 13 figures, 4 tables)

This paper contains 17 sections, 8 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: An illustration of our framework to extract background knowledge from LLMs for reward shaping in downstream RL tasks. We sample experiences from pre-collected data and request LLM feedback in different forms including code, preference, or goals. The obtained feedback is represented as potential functions for potential-based reward shaping in downstream RL tasks.
  • Figure 2: The proposed three variants of background knowledge representation from pre-collected data. (a) We query an LLM to write code that returns high values for behaviors with desired background knowledge. We ask the LLM to iteratively improve the code from sampled results. (b) We prompt an LLM to annotate its preference over two trajectories and then learn the potential function $\phi(s)$ that decomposes preferences. (c) We sample trajectories from the dataset and ask the LLM to suggest potential goals. The pair of captions and goals are stored in a text-based goal library. To use the goal library for downstream RL, we retrieve results whose trajectories are similar to agent history and compute goal similarity with the current state.
  • Figure 3: Average episodic returns of compared methods in different BabyAI goto tasks of the Minigrid environment. Task goals containing the color purple and the object type key do not appear in the collected datasets.
  • Figure 4: Average success rates of compared methods in different downstream tasks of the Crafter environment. For each task, the agent only acquires a reward when completing the corresponding achievement.
  • Figure 5: Rendered game frames from two used environments: (a) Minigrid and (b) Crafter.
  • ...and 8 more figures