Table of Contents
Fetching ...

Divide and Conquer: Grounding LLMs as Efficient Decision-Making Agents via Offline Hierarchical Reinforcement Learning

Zican Hu, Wei Liu, Xiaoye Qu, Xiangyu Yue, Chunlin Chen, Zhi Wang, Yu Cheng

TL;DR

Long-horizon decision-making with large language models (LLMs) suffers from poor exploration and credit assignment under sparse rewards. The paper introduces GLIDER, an offline hierarchical RL framework that grounds LLM policies with a two-level architecture (high-level planning and low-level execution), trained via behavior cloning followed by offline hierarchical policy refinement and optional offline-to-online adaptation. Key contributions include a parameter-efficient shared-backbone design with LoRA, a principled offline training objective at both the high- and low-levels with intrinsic subtask rewards, and strong empirical results on ScienceWorld and ALFWorld that show improved performance and generalization, including rapid online adaptation. The work demonstrates that structured, semantically grounded hierarchies enable efficient exploration and transfer, offering practical benefits for robust LLM-based agents in changing environments.

Abstract

While showing sophisticated reasoning abilities, large language models (LLMs) still struggle with long-horizon decision-making tasks due to deficient exploration and long-term credit assignment, especially in sparse-reward scenarios. Inspired by the divide-and-conquer principle, we propose an innovative framework **GLIDER** (**G**rounding **L**anguage Models as Eff**I**cient **D**ecision-Making Agents via Offline Hi**E**rarchical **R**einforcement Learning) that introduces a parameter-efficient and generally applicable hierarchy to LLM policies. We develop a scheme where the low-level controller is supervised with abstract, step-by-step plans that are learned and instructed by the high-level policy. This design decomposes complicated problems into a series of coherent chain-of-thought reasoning sub-tasks, providing flexible temporal abstraction to significantly enhance exploration and learning for long-horizon tasks. Furthermore, GLIDER facilitates fast online adaptation to non-stationary environments owing to the strong transferability of its task-agnostic low-level skills. Experiments on ScienceWorld and ALFWorld benchmarks show that GLIDER achieves consistent performance gains, along with enhanced generalization capabilities.

Divide and Conquer: Grounding LLMs as Efficient Decision-Making Agents via Offline Hierarchical Reinforcement Learning

TL;DR

Long-horizon decision-making with large language models (LLMs) suffers from poor exploration and credit assignment under sparse rewards. The paper introduces GLIDER, an offline hierarchical RL framework that grounds LLM policies with a two-level architecture (high-level planning and low-level execution), trained via behavior cloning followed by offline hierarchical policy refinement and optional offline-to-online adaptation. Key contributions include a parameter-efficient shared-backbone design with LoRA, a principled offline training objective at both the high- and low-levels with intrinsic subtask rewards, and strong empirical results on ScienceWorld and ALFWorld that show improved performance and generalization, including rapid online adaptation. The work demonstrates that structured, semantically grounded hierarchies enable efficient exploration and transfer, offering practical benefits for robust LLM-based agents in changing environments.

Abstract

While showing sophisticated reasoning abilities, large language models (LLMs) still struggle with long-horizon decision-making tasks due to deficient exploration and long-term credit assignment, especially in sparse-reward scenarios. Inspired by the divide-and-conquer principle, we propose an innovative framework **GLIDER** (**G**rounding **L**anguage Models as Eff**I**cient **D**ecision-Making Agents via Offline Hi**E**rarchical **R**einforcement Learning) that introduces a parameter-efficient and generally applicable hierarchy to LLM policies. We develop a scheme where the low-level controller is supervised with abstract, step-by-step plans that are learned and instructed by the high-level policy. This design decomposes complicated problems into a series of coherent chain-of-thought reasoning sub-tasks, providing flexible temporal abstraction to significantly enhance exploration and learning for long-horizon tasks. Furthermore, GLIDER facilitates fast online adaptation to non-stationary environments owing to the strong transferability of its task-agnostic low-level skills. Experiments on ScienceWorld and ALFWorld benchmarks show that GLIDER achieves consistent performance gains, along with enhanced generalization capabilities.

Paper Structure

This paper contains 17 sections, 7 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: GLIDER's hierarchical framework, showing significant performance gain over non-hierarchical approaches.
  • Figure 2: Overview of the GLIDER framework. (a) Hierarchical Actor-Critic architecture with prompt-controlled high- and low-level training on sampled trajectories from offline datasets. (b) Hierarchical policy structure where the high-level $\pi^h$ generates sub-task $g$ only when the low-level $\pi^l$ executes primitive actions for $c$ steps. The high-level policy provides the low-level with an intrinsic reward $\hat{r}$ that indicates the sub-task completion, and collects environment rewards across $c$ timesteps as its one-time reward as $R_t\!=\!\Sigma r_{t:t+c-1}$. (c) The training pipeline comprises SFT, ORL (offline RL), and O2O (offline-to-online RL) stages. (d) Structured hierarchical trajectories composed of high-level transitions $(d;o_t,g_t,R_t,o_{t+c})$ and low-level transitions $(g; o_t,a_t,\hat{r}_t,o_{t+1})$.
  • Figure 3: Ablation performance on unseen tasks in ScienceWorld across model architectures. Solid pillars denote hierarchical models and shaded pillars indicate ablating the hierarchy. The purple/yellow/green pillars correspond to SFT/ORL/SFT+ORL training stages, respectively.
  • Figure 4: Online fine-tuning performance (score/100) of GLIDER against AC and AWAC baselines in ScienceWorld.
  • Figure 5: Performance on unseen tasks in ScienceWorld with different expert-to-medium data mixture ratios in the offline RL stage with Llama-3-8B as the LLM backbone.
  • ...and 4 more figures