Table of Contents
Fetching ...

LLM-Guided Reinforcement Learning: Addressing Training Bottlenecks through Policy Modulation

Heng Tan, Hua Yan, Yu Yang

TL;DR

The paper introduces ULTRA, a framework that uses large language models to address reinforcement learning training bottlenecks by identifying critical states from trajectories of a suboptimal policy and guiding policy refinement through LLM-suggested actions and explanation-based rewards. This approach avoids additional model retraining or human input, instead leveraging the LLM's in-context reasoning and case-based analysis to shape policy updates. Empirical results on Pong and MuJoCo benchmarks show ULTRA variants outperform state-of-the-art baselines, with ULTRA-RA providing the strongest gains by combining action corrections and reward shaping. Overall, the work demonstrates the feasibility and benefits of explanation-driven LLM guidance for accelerating RL training in both sparse and dense reward settings.

Abstract

While reinforcement learning (RL) has achieved notable success in various domains, training effective policies for complex tasks remains challenging. Agents often converge to local optima and fail to maximize long-term rewards. Existing approaches to mitigate training bottlenecks typically fall into two categories: (i) Automated policy refinement, which identifies critical states from past trajectories to guide policy updates, but suffers from costly and uncertain model training; and (ii) Human-in-the-loop refinement, where human feedback is used to correct agent behavior, but this does not scale well to environments with large or continuous action spaces. In this work, we design a large language model-guided policy modulation framework that leverages LLMs to improve RL training without additional model training or human intervention. We first prompt an LLM to identify critical states from a sub-optimal agent's trajectories. Based on these states, the LLM then provides action suggestions and assigns implicit rewards to guide policy refinement. Experiments across standard RL benchmarks demonstrate that our method outperforms state-of-the-art baselines, highlighting the effectiveness of LLM-based explanations in addressing RL training bottlenecks.

LLM-Guided Reinforcement Learning: Addressing Training Bottlenecks through Policy Modulation

TL;DR

The paper introduces ULTRA, a framework that uses large language models to address reinforcement learning training bottlenecks by identifying critical states from trajectories of a suboptimal policy and guiding policy refinement through LLM-suggested actions and explanation-based rewards. This approach avoids additional model retraining or human input, instead leveraging the LLM's in-context reasoning and case-based analysis to shape policy updates. Empirical results on Pong and MuJoCo benchmarks show ULTRA variants outperform state-of-the-art baselines, with ULTRA-RA providing the strongest gains by combining action corrections and reward shaping. Overall, the work demonstrates the feasibility and benefits of explanation-driven LLM guidance for accelerating RL training in both sparse and dense reward settings.

Abstract

While reinforcement learning (RL) has achieved notable success in various domains, training effective policies for complex tasks remains challenging. Agents often converge to local optima and fail to maximize long-term rewards. Existing approaches to mitigate training bottlenecks typically fall into two categories: (i) Automated policy refinement, which identifies critical states from past trajectories to guide policy updates, but suffers from costly and uncertain model training; and (ii) Human-in-the-loop refinement, where human feedback is used to correct agent behavior, but this does not scale well to environments with large or continuous action spaces. In this work, we design a large language model-guided policy modulation framework that leverages LLMs to improve RL training without additional model training or human intervention. We first prompt an LLM to identify critical states from a sub-optimal agent's trajectories. Based on these states, the LLM then provides action suggestions and assigns implicit rewards to guide policy refinement. Experiments across standard RL benchmarks demonstrate that our method outperforms state-of-the-art baselines, highlighting the effectiveness of LLM-based explanations in addressing RL training bottlenecks.

Paper Structure

This paper contains 15 sections, 2 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: An overview of our framework. (i) Identification: we collect trajectories from a suboptimal RL policy, convert them into natural language with environment context, and prompt an LLM to identify critical states in each episode. (ii) Improvement: after critical states are identified, the agent follows its original policy at non-critical states, while at critical states, it adopts the actions suggested by the LLM and receives the corresponding LLM-generated rewards. The trajectories generated after (i) and (ii) serve as training data for further policy updates.
  • Figure 2: A simplified version of the prompt for identifying critical states in the Pong environment
  • Figure 3: An example of case analysis
  • Figure 4: The identifications and action suggestions in three timesteps
  • Figure 5: The LLM-generated rewards in three timesteps