Table of Contents
Fetching ...

Words as Beacons: Guiding RL Agents with High-Level Language Prompts

Unai Ruiz-Gonzalez, Alain Andres, Pedro G. Bascoy, Javier Del Ser

TL;DR

This work tackles sparse reward reinforcement learning by using pretrained LLMs as teacher agents that generate a curriculum of subgoals for the learner. The framework introduces a goal-conditioned policy $\pi(a_t|o_t,g_n)$ and a subgoal reward $r_t^g$ scaled by $\alpha$, combined with an intrinsic horizon normalization, to guide exploration through subgoals $g_0,...,g_N$ across three representations: positional, representation-based, and language embeddings. An offline subgoal modeling strategy reduces the need for continual LLM queries during training, enabling efficient curriculum learning over the environment distribution. Empirical results on MiniGrid show that representation-based subgoals with a well-tuned reward balance yield substantial speedups (up to $200\times$ fewer training steps) and robust improvement across diverse tasks, with Llama often outperforming alternative LLMs. The work demonstrates the practical potential of LLM-guided curricula to enhance sample efficiency in sparse RL, while outlining limitations and directions for broader generalization and deployment-ready filtering of language outputs.

Abstract

Sparse reward environments in reinforcement learning (RL) pose significant challenges for exploration, often leading to inefficient or incomplete learning processes. To tackle this issue, this work proposes a teacher-student RL framework that leverages Large Language Models (LLMs) as "teachers" to guide the agent's learning process by decomposing complex tasks into subgoals. Due to their inherent capability to understand RL environments based on a textual description of structure and purpose, LLMs can provide subgoals to accomplish the task defined for the environment in a similar fashion to how a human would do. In doing so, three types of subgoals are proposed: positional targets relative to the agent, object representations, and language-based instructions generated directly by the LLM. More importantly, we show that it is possible to query the LLM only during the training phase, enabling agents to operate within the environment without any LLM intervention. We assess the performance of this proposed framework by evaluating three state-of-the-art open-source LLMs (Llama, DeepSeek, Qwen) eliciting subgoals across various procedurally generated environment of the MiniGrid benchmark. Experimental results demonstrate that this curriculum-based approach accelerates learning and enhances exploration in complex tasks, achieving up to 30 to 200 times faster convergence in training steps compared to recent baselines designed for sparse reward environments.

Words as Beacons: Guiding RL Agents with High-Level Language Prompts

TL;DR

This work tackles sparse reward reinforcement learning by using pretrained LLMs as teacher agents that generate a curriculum of subgoals for the learner. The framework introduces a goal-conditioned policy and a subgoal reward scaled by , combined with an intrinsic horizon normalization, to guide exploration through subgoals across three representations: positional, representation-based, and language embeddings. An offline subgoal modeling strategy reduces the need for continual LLM queries during training, enabling efficient curriculum learning over the environment distribution. Empirical results on MiniGrid show that representation-based subgoals with a well-tuned reward balance yield substantial speedups (up to fewer training steps) and robust improvement across diverse tasks, with Llama often outperforming alternative LLMs. The work demonstrates the practical potential of LLM-guided curricula to enhance sample efficiency in sparse RL, while outlining limitations and directions for broader generalization and deployment-ready filtering of language outputs.

Abstract

Sparse reward environments in reinforcement learning (RL) pose significant challenges for exploration, often leading to inefficient or incomplete learning processes. To tackle this issue, this work proposes a teacher-student RL framework that leverages Large Language Models (LLMs) as "teachers" to guide the agent's learning process by decomposing complex tasks into subgoals. Due to their inherent capability to understand RL environments based on a textual description of structure and purpose, LLMs can provide subgoals to accomplish the task defined for the environment in a similar fashion to how a human would do. In doing so, three types of subgoals are proposed: positional targets relative to the agent, object representations, and language-based instructions generated directly by the LLM. More importantly, we show that it is possible to query the LLM only during the training phase, enabling agents to operate within the environment without any LLM intervention. We assess the performance of this proposed framework by evaluating three state-of-the-art open-source LLMs (Llama, DeepSeek, Qwen) eliciting subgoals across various procedurally generated environment of the MiniGrid benchmark. Experimental results demonstrate that this curriculum-based approach accelerates learning and enhances exploration in complex tasks, achieving up to 30 to 200 times faster convergence in training steps compared to recent baselines designed for sparse reward environments.

Paper Structure

This paper contains 24 sections, 1 equation, 10 figures, 2 tables, 1 algorithm.

Figures (10)

  • Figure 1: RL framework with teacher LLM. In this framework, $g_0,...,g_N$ are the subgoals provided by the LLM at the beginning of each episode according to the initial state information $s_0$.
  • Figure 2: An episode belonging to the environment KeyCorridorS3R3 from MiniGrid, where the possible subgoals subject to the error are shown.The first subgoal is the key located at the bottom left position of the grid. (left) Possible subgoals for representation-based subgoals; (right) the possible grid locations according to a Manhattan distance of 2.
  • Figure 3: Comparison of the training curves for the proposed CL-based LLM-assisted RL training framework, showing the average return over training steps for the six environments using three methodologies: relative to the agent's position (first column), representation-based (second column) and language-based (third column). The analysis includes three LLMs—Qwen, Llama, and DeepSeek—against the Oracle subgoals. The shaded area represents the variability in the average return across $5$ runs of the agent's training process.
  • Figure 4: Performance comparison for different training scenarios using Llama and representation-based subgoals. The figure displays results for no-reward training (first and third columns) and no-subgoal training (second and fourth columns). The no-reward training condition shows the agent's performance when rewards are absent, but representation subgoals are present, while no-subgoal training condition shows the performance when subgoals are absent, but rewards are still present.
  • Figure 5: Initial prompt to the LLM about its knowledge of MiniGrid, establishing a foundational understanding of the environment.
  • ...and 5 more figures