World Models with Hints of Large Language Models for Goal Achieving

Zeyuan Liu; Ziyu Huan; Xiyao Wang; Jiafei Lyu; Jian Tao; Xiu Li; Furong Huang; Huazhe Xu

World Models with Hints of Large Language Models for Goal Achieving

Zeyuan Liu, Ziyu Huan, Xiyao Wang, Jiafei Lyu, Jian Tao, Xiu Li, Furong Huang, Huazhe Xu

TL;DR

The paper tackles the difficulty of long-horizon reinforcement learning with sparse rewards by introducing DLLM, a multi-modal model-based framework that incorporates large language models to generate goal descriptions and intrinsic rewards during world-model rollouts. DLLM ground-ls goals in observations via SentenceBert embeddings and uses a cosine-similarity mechanism to reward transitions aligned with these goals, while a novelty-based RND component prevents repetitive behavior. The world model (RSSM) and actor-critic learner are trained end-to-end with a loss that accounts for perception, transitions, rewards, and prediction quality, integrating language-driven guidance into planning. Empirical results across HomeGrid, Crafter, and Minecraft show DLLM outperforms several strong baselines, with larger gains when using stronger LLMs, highlighting the practical value of language-informed exploration and planning for complex, sparse-reward tasks.

Abstract

Reinforcement learning struggles in the face of long-horizon tasks and sparse goals due to the difficulty in manual reward specification. While existing methods address this by adding intrinsic rewards, they may fail to provide meaningful guidance in long-horizon decision-making tasks with large state and action spaces, lacking purposeful exploration. Inspired by human cognition, we propose a new multi-modal model-based RL approach named Dreaming with Large Language Models (DLLM). DLLM integrates the proposed hinting subgoals from the LLMs into the model rollouts to encourage goal discovery and reaching in challenging tasks. By assigning higher intrinsic rewards to samples that align with the hints outlined by the language model during model rollouts, DLLM guides the agent toward meaningful and efficient exploration. Extensive experiments demonstrate that the DLLM outperforms recent methods in various challenging, sparse-reward environments such as HomeGrid, Crafter, and Minecraft by 27.7\%, 21.1\%, and 9.9\%, respectively.

World Models with Hints of Large Language Models for Goal Achieving

TL;DR

Abstract

Paper Structure (40 sections, 13 equations, 12 figures, 9 tables, 1 algorithm)

This paper contains 40 sections, 13 equations, 12 figures, 9 tables, 1 algorithm.

Introduction
Background and Related Work
Preliminaries
Dreaming with LLMs
Goal Generation by Prompting LLMs
Incorporating Decreased Intrinsic Rewards into Dreaming Processes
World Model and Actor Critic Learning
Experiments
HomeGrid
Crafter
Minecraft
Conclusion and Discussion
Environment Details
HomeGrid
Details of Environmental Adjustments
...and 25 more sections

Figures (12)

Figure 1: The algorithmic overall structure diagram of DLLM, where WM denotes the world model, $o_l$ represents the natural language caption of the observation, $u$ denotes the transition, and $i_k$ corresponds to the intrinsic reward for the $k$-th goal.
Figure 2: HomeGrid experiments results. Curves averaged over 5 seeds with shading representing one-eighth of the standard deviation.
Figure 3: Left. The bar chart comparison of the means and standard deviations between DLLM and baselines. DLLM generally exhibits higher average performance, surpassing baselines by a large margin. Right. The logarithmic scale success rates for unlocking 18 in 22 achievements at 1M (with the remaining four never achieved otherwise). DLLM surpasses baselines in most achievements, particularly excelling in challenging tasks such as "make stone pickaxe/sword" and "collect iron". "AD" refers to Achievement Distillation moon2024discovering, we utilize its official code base to obtain success rate results.
Figure 4: The episode returns in Minecraft Diamond. The curves indicate that DLLM enjoys a consistent advantage throughout the entire learning process, thanks to its utilization of an LLM for exploration and training. All algorithms undergo experiments using 5 different seeds.
Figure 5: Illustration of the difference between our adapted environment map (b) and the original environment map (a). We generally use a smaller map but ensure that the characteristics of the underlying environment remain unchanged. When and only when the robot succeeds in opening any bin, there will be an icon of the action the robot takes at the lower right of the pixel observation, as shown in (c).
...and 7 more figures

World Models with Hints of Large Language Models for Goal Achieving

TL;DR

Abstract

World Models with Hints of Large Language Models for Goal Achieving

Authors

TL;DR

Abstract

Table of Contents

Figures (12)