Table of Contents
Fetching ...

Extracting Heuristics from Large Language Models for Reward Shaping in Reinforcement Learning

Siddhant Bhambri, Amrita Bhattacharjee, Durgesh Kalwar, Lin Guan, Huan Liu, Subbarao Kambhampati

TL;DR

This work uses off-the-shelf LLMs to generate a plan for an abstraction of the underlying MDP, and shows a significant improvement in the sample efficiency of PPO, A2C, and Q-learning when guided by the LLM-generated heuristics.

Abstract

Reinforcement Learning (RL) suffers from sample inefficiency in sparse reward domains, and the problem is further pronounced in case of stochastic transitions. To improve the sample efficiency, reward shaping is a well-studied approach to introduce intrinsic rewards that can help the RL agent converge to an optimal policy faster. However, designing a useful reward shaping function for all desirable states in the Markov Decision Process (MDP) is challenging, even for domain experts. Given that Large Language Models (LLMs) have demonstrated impressive performance across a magnitude of natural language tasks, we aim to answer the following question: `Can we obtain heuristics using LLMs for constructing a reward shaping function that can boost an RL agent's sample efficiency?' To this end, we aim to leverage off-the-shelf LLMs to generate a plan for an abstraction of the underlying MDP. We further use this LLM-generated plan as a heuristic to construct the reward shaping signal for the downstream RL agent. By characterizing the type of abstraction based on the MDP horizon length, we analyze the quality of heuristics when generated using an LLM, with and without a verifier in the loop. Our experiments across multiple domains with varying horizon length and number of sub-goals from the BabyAI environment suite, Household, Mario, and, Minecraft domain, show 1) the advantages and limitations of querying LLMs with and without a verifier to generate a reward shaping heuristic, and, 2) a significant improvement in the sample efficiency of PPO, A2C, and Q-learning when guided by the LLM-generated heuristics.

Extracting Heuristics from Large Language Models for Reward Shaping in Reinforcement Learning

TL;DR

This work uses off-the-shelf LLMs to generate a plan for an abstraction of the underlying MDP, and shows a significant improvement in the sample efficiency of PPO, A2C, and Q-learning when guided by the LLM-generated heuristics.

Abstract

Reinforcement Learning (RL) suffers from sample inefficiency in sparse reward domains, and the problem is further pronounced in case of stochastic transitions. To improve the sample efficiency, reward shaping is a well-studied approach to introduce intrinsic rewards that can help the RL agent converge to an optimal policy faster. However, designing a useful reward shaping function for all desirable states in the Markov Decision Process (MDP) is challenging, even for domain experts. Given that Large Language Models (LLMs) have demonstrated impressive performance across a magnitude of natural language tasks, we aim to answer the following question: `Can we obtain heuristics using LLMs for constructing a reward shaping function that can boost an RL agent's sample efficiency?' To this end, we aim to leverage off-the-shelf LLMs to generate a plan for an abstraction of the underlying MDP. We further use this LLM-generated plan as a heuristic to construct the reward shaping signal for the downstream RL agent. By characterizing the type of abstraction based on the MDP horizon length, we analyze the quality of heuristics when generated using an LLM, with and without a verifier in the loop. Our experiments across multiple domains with varying horizon length and number of sub-goals from the BabyAI environment suite, Household, Mario, and, Minecraft domain, show 1) the advantages and limitations of querying LLMs with and without a verifier to generate a reward shaping heuristic, and, 2) a significant improvement in the sample efficiency of PPO, A2C, and Q-learning when guided by the LLM-generated heuristics.
Paper Structure (46 sections, 5 figures, 3 tables, 1 algorithm)

This paper contains 46 sections, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: (I) We use the verifier-augmented LLM to generate a valid (guide) plan for the relaxed search problem. (II) We construct the reward shaping function using the guide plan to add intrinsic rewards by updating the RL agent's replay buffer. (III) Using these intrinsic rewards, the RL agent learns an optimal policy for the underlying stochastic sparse-reward MDP.
  • Figure 2: Visualizations for the BabyAI suite and Household, Mario, and, the MineCraft environment.
  • Figure 3: RQ2.1 Results: Smoothed learning curves comparing vanilla PPO (top) and vanilla A2C (bottom) with reward shaping on respective algorithms using LLM-generated partial plan and with reward shaping using three variations of LLM-generated complete plans, as measured on the episodic returns. The solid lines and shaded regions represent the mean and standard deviation across five runs, respectively.
  • Figure 4: RQ2.2 Results: Smoothed learning curves comparing baseline Q-learning training against baseline Q-learning with reward shaping using LLM-generated partial plan, and with LLM-generated complete plan as measured on the episodic returns. The solid lines and shaded regions represent the mean and standard deviation across five runs, respectively.
  • Figure 5: Environment Layouts from the BabyAI environment suite used for experiments.