Table of Contents
Fetching ...

Option Discovery Using LLM-guided Semantic Hierarchical Reinforcement Learning

Chak Lam Shek, Pratap Tokekar

TL;DR

The paper tackles the difficulty of exploration and generalization in long-horizon robotic tasks by integrating large language model reasoning into a semantic hierarchical RL framework, LDSC. It introduces a three-level policy stack (subgoal, option, action) guided by an LLM that generates subgoals and constructs reusable option trees, enabling efficient planning and skill transfer across tasks. Empirical results in Mujoco show substantial gains in average reward (≈55.9%), faster completion (≈53.1%), and higher success rates (≈72.7%) without extra training time, demonstrating robust performance in multi-task environments. This work highlights the potential of combining semantic reasoning with hierarchical control to produce scalable, transferable policies for complex robotics tasks.

Abstract

Large Language Models (LLMs) have shown remarkable promise in reasoning and decision-making, yet their integration with Reinforcement Learning (RL) for complex robotic tasks remains underexplored. In this paper, we propose an LLM-guided hierarchical RL framework, termed LDSC, that leverages LLM-driven subgoal selection and option reuse to enhance sample efficiency, generalization, and multi-task adaptability. Traditional RL methods often suffer from inefficient exploration and high computational cost. Hierarchical RL helps with these challenges, but existing methods often fail to reuse options effectively when faced with new tasks. To address these limitations, we introduce a three-stage framework that uses LLMs for subgoal generation given natural language description of the task, a reusable option learning and selection method, and an action-level policy, enabling more effective decision-making across diverse tasks. By incorporating LLMs for subgoal prediction and policy guidance, our approach improves exploration efficiency and enhances learning performance. On average, LDSC outperforms the baseline by 55.9\% in average reward, demonstrating its effectiveness in complex RL settings. More details and experiment videos could be found in \href{https://raaslab.org/projects/LDSC/}{this link\footnote{https://raaslab.org/projects/LDSC}}.

Option Discovery Using LLM-guided Semantic Hierarchical Reinforcement Learning

TL;DR

The paper tackles the difficulty of exploration and generalization in long-horizon robotic tasks by integrating large language model reasoning into a semantic hierarchical RL framework, LDSC. It introduces a three-level policy stack (subgoal, option, action) guided by an LLM that generates subgoals and constructs reusable option trees, enabling efficient planning and skill transfer across tasks. Empirical results in Mujoco show substantial gains in average reward (≈55.9%), faster completion (≈53.1%), and higher success rates (≈72.7%) without extra training time, demonstrating robust performance in multi-task environments. This work highlights the potential of combining semantic reasoning with hierarchical control to produce scalable, transferable policies for complex robotics tasks.

Abstract

Large Language Models (LLMs) have shown remarkable promise in reasoning and decision-making, yet their integration with Reinforcement Learning (RL) for complex robotic tasks remains underexplored. In this paper, we propose an LLM-guided hierarchical RL framework, termed LDSC, that leverages LLM-driven subgoal selection and option reuse to enhance sample efficiency, generalization, and multi-task adaptability. Traditional RL methods often suffer from inefficient exploration and high computational cost. Hierarchical RL helps with these challenges, but existing methods often fail to reuse options effectively when faced with new tasks. To address these limitations, we introduce a three-stage framework that uses LLMs for subgoal generation given natural language description of the task, a reusable option learning and selection method, and an action-level policy, enabling more effective decision-making across diverse tasks. By incorporating LLMs for subgoal prediction and policy guidance, our approach improves exploration efficiency and enhances learning performance. On average, LDSC outperforms the baseline by 55.9\% in average reward, demonstrating its effectiveness in complex RL settings. More details and experiment videos could be found in \href{https://raaslab.org/projects/LDSC/}{this link\footnote{https://raaslab.org/projects/LDSC}}.

Paper Structure

This paper contains 17 sections, 11 equations, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comparison of HRL and LDSC in terms of exploration, generalization, and usability. (a) Inefficient Exploration: LDSC improves exploration by leveraging subgoals to guide the agent efficiently, reducing the need for exhaustive search. (b) Lack of Option Generalization: Unlike HRL, which learns state-oriented options that do not generalize well, LDSC learns goal-oriented options that can be reused across different tasks. (c) Limited Usability of Options: LDSC structures options around subgoals, enhancing their reusability for multiple tasks, improving learning efficiency, and enabling better skill transfer.
  • Figure 2: Overview of the LDSC framework. The process consists of two phases: Before Training and During Training. In the Before Training phase, an LLM generates subgoal sequences based on task descriptions and initial conditions. During Training, a hierarchical structure operates with three components: (1) the subgoal policy, which constructs a subgoal relation tree and selects subgoals; (2) the option policy, which builds an option tree and determines the best option; and (3) the action policy, which selects the final action based on the chosen option.
  • Figure 3: Structured prompt design for generating subgoal sequences in a robotic planning task. The figure outlines different sections of the prompt, including task description, state representation, goal and subgoal sequencing, and in-context examples. The example shown serves as a qualitative example based on the Point Maze environment, demonstrating how a robot navigates the maze by reasoning over object relationships and environmental constraints to generate feasible and logically ordered action sequences.
  • Figure 4: Qualitative performance of the robot in the Point Maze environment. The upper row shows the initial set for each option, illustrating the state space coverage where the option can be executed. The lower row displays the corresponding policy plots, where orange regions indicate areas where the policy continues execution, while green regions signify termination states. The robot follows a structured sequence: first reaching subgoal 1 (top-right), then subgoal 2 (bottom-right), and finally the goal (top-left).
  • Figure 5: Trajectory of the quantitative example, where the path is colored indicating the change of options. Each segment of the trajectory represents the agent's movement in 2D space while executing a specific option.
  • ...and 1 more figures