Table of Contents
Fetching ...

Dual-Scale World Models for LLM Agents Towards Hard-Exploration Problems

Minsoo Kim, Seung-won Hwang

TL;DR

The paper tackles hard-exploration for LLM agents by introducing GLoW, a dual-scale world-model framework that combines a global trajectory frontier for principled state selection with a local exploration strategy based on Multi-path Advantage Reflection (MAR). The global model extracts high-value patterns across discovered trajectories, while MAR densifies sparse rewards through semantic advantages at critical decision points, guided by an LLM-enabled policy. On the Jericho benchmark, GLoW achieves state-of-the-art results among LLM-based approaches and rivals RL-based methods while dramatically reducing environment interactions, demonstrating strong sample efficiency and robust exploration. This work highlights the value of coupling long-horizon, frontier-driven learning with local, advantage-based exploration signals to overcome sparse rewards in complex text-based environments.

Abstract

LLM-based agents have seen promising advances, yet they are still limited in "hard-exploration" tasks requiring learning new knowledge through exploration. We present GLoW, a novel approach leveraging dual-scale world models, maintaining a trajectory frontier of high-value discoveries at the global scale, while learning from local trial-and-error in exploration through a Multi-path Advantage Reflection mechanism which infers advantage-based progress signals to guide exploration. To evaluate our framework for hard-exploration, we tackle the Jericho benchmark suite of text-based games, where GLoW achieves a new state-of-theart performance for LLM-based approaches. Compared to state-of-the-art RLbased methods, our approach achieves comparable performance while requiring 100-800x fewer environment interactions.

Dual-Scale World Models for LLM Agents Towards Hard-Exploration Problems

TL;DR

The paper tackles hard-exploration for LLM agents by introducing GLoW, a dual-scale world-model framework that combines a global trajectory frontier for principled state selection with a local exploration strategy based on Multi-path Advantage Reflection (MAR). The global model extracts high-value patterns across discovered trajectories, while MAR densifies sparse rewards through semantic advantages at critical decision points, guided by an LLM-enabled policy. On the Jericho benchmark, GLoW achieves state-of-the-art results among LLM-based approaches and rivals RL-based methods while dramatically reducing environment interactions, demonstrating strong sample efficiency and robust exploration. This work highlights the value of coupling long-horizon, frontier-driven learning with local, advantage-based exploration signals to overcome sparse rewards in complex text-based environments.

Abstract

LLM-based agents have seen promising advances, yet they are still limited in "hard-exploration" tasks requiring learning new knowledge through exploration. We present GLoW, a novel approach leveraging dual-scale world models, maintaining a trajectory frontier of high-value discoveries at the global scale, while learning from local trial-and-error in exploration through a Multi-path Advantage Reflection mechanism which infers advantage-based progress signals to guide exploration. To evaluate our framework for hard-exploration, we tackle the Jericho benchmark suite of text-based games, where GLoW achieves a new state-of-theart performance for LLM-based approaches. Compared to state-of-the-art RLbased methods, our approach achieves comparable performance while requiring 100-800x fewer environment interactions.

Paper Structure

This paper contains 28 sections, 10 equations, 2 figures, 4 tables, 3 algorithms.

Figures (2)

  • Figure 1: (a) Select procedure in GLoW, (b) Illustration of selection with Global World Model
  • Figure 2: (a) Explore procedure in GLoW, (b) Illustration of exploration with Local World Model